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INTRODUCTION 


Is rue six previous issues of the Review devoted to research methods, 
there has been a steady increase in the space devoted to statistical methods. 
In the first of the six issues (1934) the emphasis was almost entirely on 
the fields in which research might be applied rather than on methods, 
and the only discussion of statistics was found in two incidental passages 
amounting to about three pages, one of which gave the reasons why the 
writer thought statistical methods were declining in popularity. 

In the two decades since the publication of that issue the nature of 
statistical thinking has changed considerably, and its importance in re- 
search has greatly increased. In 1939 the issue included a 16-page chapter 
on statistical methods and a seven-page chapter on factor analysis. In 
1942 there was a 20-page chapter devoted to the application of new sta- 
tistical technics in experimental and statistical studies, one of eight pages 
on the statistical aspects of test development, and one of eight pages on 
applications of IBM machines to educational research, or 36 pages in 
all. In 1945 the issue contained a chapter with over six pages on research 
designs, one of about 17 pages on recent developments in statistical theory, 
and one of five pages on computational technics, or about 28 pages in all. 
It is of interest to note that the chapter on statistical theory included a 
section called “Problems of Statistical Inference in the Nonparametric 
Case.” In 1948, 43 pages were devoted to statistical chapters, and in 1951, 
82 pages. 

This development in previous issues indicated clearly that a separate 
issue on statistical methodology would soon be required. The literature 
on such topics as library research, tests and measurements, methods of 
making observations, and statistical methods has now become far too 
voluminous to be treated adequately in a single issue. Accordingly this 
issue deals with the statistical aspects of educational research and not 
with methods of gathering or presenting data. 

Several members of the committee responsible for preparing this issue 
had thought that the current literature contained numerous criticisms 
of the use of statistics in education and that a useful introductory chapter 
could present and analyze such criticisms. The editor and Dr. Sam Duker 
agreed to write this section, and Dr. Duker made an extensive search for 
appropriate material. He found many criticisms directed toward other as- 
pects of educational research, but surprisingly few directed toward the 
statistical analysis. Those comments found were often general rather than 
specific and did not cover a sufficiently broad range of topics to serve as 
basis for a chapter. Consequently this plan was abandoned. This lack of 
trenchant published criticism might mean any of many things. It might 
be due to satisfaction with the methods customarily in use, either because 
these are generally excellent or because better methods are unfamiliar. 
or it might be due to professional courtesy toward one’s colleagues. A still 
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more plausible explanation is that most authors of expository statistical 
articles have busied themselves with describing preferred methods and not 
with enumerating flaws. 

Statistical theory and method have developed in four great waves of 
ideas, in four periods, each of which has a very definite beginning but no 
ending. Each period made a great contribution to research methods. Some 
of the ideas and methods developed in each period have what at the present 
writing looks like timeless validity, so that they appear to be a permanent 
contribution to research method. Other ideas and methods have been super- 
seded by more inclusive or more basic ideas and more effective methods. 
Yet the research worker who has learned one set of concepts and methods 
often clings to these long after better ones are known. Sometimes one 
hears the charge that “educators are still living in the Pearsonian period 
of statistical thinking.” 

The first great wave of modern statistical activity and thought was in- 
augurated by the publication of Francis Galton’s Natural Inheritance in 
1889 and of Karl Pearson’s series of Mathematical Contributions to the 
Theory of Evolution beginning in 1893. This period was marked by (a) 
the new conviction that the analysis of statistical data might provide the 
answers for a vast variety of important questions regarding the physical 
universe and human life, (b) the collection of huge aggregates of data in 
quantity and diversity undreamed of before 1890, (c) the proliferation 
of new statistics together with a strong emphasis on the development of 
standard error formulas appropriate for large samples and on correlational 
technics, (d) the invention of chi-square, and (e) the attempt to char- 
acterize all frequency distributions by variants of a single differential 
equation. 

While chronologically the first studies in factor analysis occur in this 
period, that subject has always remained somewhat apart from the main 
stream of statistical thought, developing independently without much stimu- 
lation from the four waves of statistical activity described here. 

The second period began in 1915 when R. A. Fisher published his first 
paper on the exact distribution of the correlation coefficient. This period 
was marked by (a) the development of methods appropriate for use with 
small samples; (b) the search for exact sampling distributions as con- 
trasted with the previous general dependence on asymptotic standard error 
formulas; (c) the discovery of the dominating importance of certain 
of these exact distributions, namely, the previously known chi-square 
distribution and the newly discovered z, F and ¢ distributions, and the rela- 
tion of each to the long familiar curve of normal probability; (d) the first 
formulation of logical principles for testing hypotheses and the design of 
experiments; (f) the invention of the technic of analysis of variance; (g) 
the development of criteria for choice among statistics used as estimators 
(efficient, consistent, sufficient, and maximum likelihood statistics) and the 
consequent slow disappearance of some previously popular statistics; (h) 
the extension of multivariate analysis beyond the already familiar multiple 
correlation; (i) the clarification of the notion of degrees of freedom; and 
(j) the introduction of the concept of fiducial probability. 
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The third period began about 1928 with the publication of certain joint 
papers by Neyman and Egon Pearson. The number of persons in the field 
competent to develop new theory now increased greatly, and the important 
literature began a rapid expansion. The logic of statistical inference in 
this period was considerably changed by the introduction of such ideas 
as (a) the second kind of error, that is, the error of accepting a hypothesis 
when some other hypothesis is true; (b) the power of a statistical test; 
(c) the confidence interval (instead of the symmetric interval defined by 
an observed statistic plus or minus a multiple of its standard error used by 
Karl Pearson or the interval determined by the method of fiducial prob- 
ability used by Fisher) ; (d) theories and methods related to the selection 
of samples, especially for the purpose of surveys as contrasted with work 
on the design of experiments which was already well advanced; (e) se- 
quential analysis; and (f) the widespread application of statistical methods 
to quality control in industry. 

Apparently we are now moving into a fourth period, the lines of which 
are not yet wholly clear. It could well be argued that sequential analysis 
belongs in this period because it was the contribution of Wald whose 
creative genius is largely responsible for the reformulation of concepts 
which is now under way. However, I have placed it in the third period 
because its theory and applications are a direct consequence of the Ney- 
man-Pearson theory of testing hypotheses. The basic ideas of decision 
theory which may have a profound effect on statistical inference were 
developed by Wald shortly before his untimely death in 1950. While the 
general logic of statistical decision theory readily appeals to the practical 
person who must decide on a course of action, there are formidable dif- 
ficulties which still block its application in a concrete situation. In time, 
ways of overcoming these difficulties will probably be found. 

Clearly it is not now possible to present a review of the applications 
of statistical decision theory in education and related fields, for such ap- 
plications have not been made. However, the editorial committee for this 
issue of the Review considered that it would be a service to the field of 
educational research to outline here the general nature of the logic em- 
ployed in statistical decision theory, so that our readers might be con- 
versant with these general ideas which seem likely to exert a profound 
influence on research. Accordingly we requested a short exposition from 
the pen of a man who has made numerous important contributions to 
statistical theory and method and who is now creatively at work on decision 
theory. 

Because the chapter on Decision Theory is different in type from those 
customarily published in the Review or EpucaTIONAL RESEARCH, some 
explanation of the reason for its presence has seemed necessary in order 
that it may not be misunderstood as marking a change in the long-standing 
policy of this journal or setting a precedent for the inclusion of expository 
articles. 

Hevten M. WALKER, Chairman, 


Committee on Statistical Methodology 
in Educational Research 
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CHAPTER I 


Sample Surveys in Education 


FRANCIS G. CORNELL 


Tue methodology reported in this chapter is of recent origin. Literature 
on the application of sampling theory to practical survey problems as 
we know it today was virtually unavailable as recently as 1940. Ten years 
ago the basic principles of the theory and the major operating technics of 
survey sampling were known only by a few experts. Theory and practice 
in the field covered by this chapter have been developed largely outside 
education in opinion polling, in market research, and in census operations. 
It is not surprising, therefore, that there is considerably more literature 
about survey sampling than there is about survey sampling in education. 

The wealth of understandable and technically adequate sources on sur- 
vey sampling which accumulated during the period covered in this issue of 
the Review justifies the devotion of a separate chapter to it. This chapter 
supplements a section of the Johnson and Moonan chapter (27) of a pre- 
vious issue which covered survey sampling thru the middle of 1951. 

Survey sampling methods apply to problems of enumeration (determin- 
ing how many or how much), particularly with reference to finite popula- 
tions. As such they differ considerably from common theory and methods 
in educational statistics, which deal with infinite populations and with 
sample methods in experimental design. Altho in recent years the statistical 
literature has included many items on sampling concerned primarily with 
sampling inspection (acceptance sampling), this area is excluded from the 
review of this chapter except as immediately applicable to survey sampling. 
Sampling surveys involving measurement, psychometric problems of item 
design, the measurement of attitudes and opinions, and scaling are not 
covered in this chapter. 


General References 


Instruction in educational statistics has not in general encouraged edu- 
cational researchers to discover and use advanced survey sample methods. 
This has been due at least in part to the conventional, classical models 
used in elementary textbooks in educational statistics. The appearance 
of the new text by Walker and Lev (56) may be indicative of a new trend 
in the statistical training of educational researchers. The Walker and Lev 
publication included a section on survey sampling, giving common methods 
of estimating means, totals, and proportions, using both stratified- and 
cluster-sampling designs with appropriate sampling variances. More im- 
portant, the text is inductive and nonmathematical in approach and builds 
many of the basic concepts essential to understanding modern sample de- 
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sign (e.g., sampling distribution, mathematical expectation, and interval 
estimation). 

Educators interested in survey sampling will find the new publications 
by Cochran (3) and by Hansen, Hurwitz, and Madow (21) welcome addi- 
tions to the earlier publications by Deming (15) and Yates (60) to com- 
plete a compact and relatively complete library on the subject. Cochran’s 
book will be found understandable and usable to persons familiar with 
elementary statistical topics at the level of simpler types of analysis of 
variance. Some calculus is required to follow some of the proofs. The 
Hansen, Hurwitz, and Madow publication appeared in two volumes, the 
first of which is exceptionally communicative, giving principles and meth- 
ods of sampling and their applications to various types of problems in a 
very readable, nonmathematical manner. The first three chapters of Volume 
I would be profitable to persons without statistical background who are in- 
terested in the ideas of sampling. The approach is intuitive; the illustra- 
tions are excellent. A good elementary course in statistics is probably all 
that would be required, with the exception of some calculus, to master 
not only the applications in Volume I, but also the proofs of theory pre- 
sented in Volume II. While the two volumes draw heavily upon the prob- 
ability sampling used in the U. S. census, the applications are appropriate 
to education. It is noted that populations, certain characteristics of which 
are of interest to educators, vary from pupils, parents, teachers, and tax- 
payers to school buildings, school buses, taxing districts, and words in 
the English language. The above sources contain good bibliographies bring- 
ing the bibliographical history of sample survey methods pretty much up 
to the period covered by this review. A brief but relatively complete state- 
ment on sampling methods was prepared by Kish (29) as one chapter 
in a volume by the Michigan Survey Research Center Group. 

Two additional items of general interest that appeared during the pe- 
riod make good nonmathematical reading and were prepared by very 
qualified writers on the subject. Papers by Cochran (2), Cornfield (8). 
and Hansen and Hurwitz (20) discussed modern methods in the sampling 
of human populations, principles in the selection of a sample, the de- 
termination of sample size, and area sampling in the local community. 
These papers resulted from a symposium on the subject held by the Ameri- 
can Public Health Association. A critique of sample methods used in the 
first Kinsey report by a highly qualified committee of the American 
Statistical Association (4, 5) is a valuable general discussion of survey 
sampling problems. The main text of the report (5) and an appendix 
on principles of sampling (4) contain skilful exposition of the alternatives 
an investigator has in the planning and designing of a survey. Some of the 
difficulties of probability and nonprobability samples are discussed. The 
problems are very similar to those encountered in educational surveys. 
Included is a discussion of nonsampling aspects of the survey such as 
interviewing, analytical technics, and interpretation. Of particular interest 
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is a discussion of methods of “salvaging,” a good lesson on the advantages 
of avoiding “unplanned samples.” 

Of general interest is a section on surveys included by Mosteller (35) 
in a chapter reviewing statistical theory and research design in psychology. 
The importance of the subject to education was emphasized by Johnson 
(26) in an article on promising newer methods in educational research 
in a special research issue of the Phi Delta Kappan. He devoted approxi- 
mately a third of his lines to the subject of survey sample design. 


Nature of Survey Sample Design 


Applications of survey sampling methods may be judged in the context 
of the logic of problem solving. The more orderly the sequence of decisions 
for the securing of information about a population, the more scientific 
will be the type of thinking, and the more useful will be the sampling 
theory. This implies that whether sampling is to be done for purposes of 
administrative action or for purposes of research, sample design depends 
initially upon the specification of objectives and purposes of the survey. 
It is considerably easier to make use of statistical theory to the advantage 
of the surveyor if there is a clear specification of the population to be 
sampled, the particular information to be obtained (the measure or char- 
acteristic of the population), and the required precision of results. The sta- 
tistical theory of survey design is difficult to apply when these requirements 
are not met. For instance, if the objectives of a survey are as ambiguous 
as “to find out what the public in a school district thinks about the school 
system,” the statistician is at a loss to be specific in the determination 
of types of estimates to be used and the plan for selection of the sample. 
In the general references of the preceding section and elsewhere (1, 8), 
the necessity of joint use of statistical theory and the subjectmatter under 
investigation and a careful formulation of objectives of the survey have 
been emphasized. 

The general class of sample designs which permit selecting a sample plan 
which will yield required precision at minimum cost, or conversely at fixed 
cost estimates with maximum precision, is now commonly called prob- 
ability sampling. This is because the measurement of sampling errors on 
known mathematical models depends upon the knowledge of the prob- 
ability with which each individual is included in a sample as well as the 
particular sample plan. The advantages of probability sampling are that 
the precision of sample results may be evaluated objectively, and the ef- 
ficiency of various types of sample designs may be compared (3, 4, 21, 29). 
Probability samples have these characteristics: 


1. Each individual (or primary unit) in the sample has some known 
probability of entering the sample. 

2. The process of sampling is automatic in one or more steps of the 
selection of elements or units in the sample. 
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3. Weights appropriate to the probabilities in (1) are used in the 
analysis of the sample. 


Alternatives to probability sampling are variously labeled “unplanned,” 
“nonprobability,” or “judgment” samples (3, 4, 5), such as the following: 


1. The sample of convenience (e.g., the superintendent’s office is housed 
in the high school; the high-school teachers being convenient, he asks some 
of them their opinions on a matter) 

2. The canvass of experts (e.g., a questionnaire to several “informed” 
persons for judgment on teacher shortage or school construction needs in 
the United States) 

3. The sample based on an obsolete list or frame which does not ade- 
quately cover the population (e.g., using a city directory or telephone book 
as a basis for sampling the adult population of a community) 

4. The sample with a high proportion of nonresponse (e.g., the common 
questionnaire study in education) 

5. The pinpoint or representative-area sample (e.g., purposive selection 
of typical individuals, or a typical school, typical classroom, or typical 
community ) 

6. The quota sample, by which there is some system of selection of pri- 
mary sampling units (such as communities), and assigning interviewers 
quotas for subsampling (e.g., an interviewer is asked to select for interview 
10 females who are high-school graduates between the ages of 18 and 25 
living in the northeast section of a city). 


The foregoing are all alike in that there is no basis for determining 
the best balance between cost (the use of resources, chiefly a consideration 
of the allocation of cases in the sample and sample size) and precision, and 
other aspects of design. They also have the common limitation that there 
is no suitable measure of precision because the probability that an in- 
dividual is included in the samples is not known. 

The simple random sampling ideas and formulas of elementary statistical 
texts are not sufficient guides to the sampler who is interested in efficient 
survey sample designs. The simple random (or unrestricted random) 
method consists of selecting n units out of the N units in the population 
such that each possible combination of n units has the same probability 
of being selected. It is not always possible to sample in this manner, and 
false conclusions are drawn if formulas for standard errors of means, 
totals, and proportions for unrestricted random sampling are applied 
to other types of samples. 

There are several reasons why the simple random design is not always 
best. In the first place, other designs may yield more precise results for 
the same size, or conversely, the same precision with a smaller total sample. 
Moreover, lists of the N individuals in the population may not exist or may 
be incomplete. Adults in a community of any size are rarely listed. On 
the other hand, the county courthouse may have a reasonably complete 
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list of registered voters which could be used in a sample on a school bond 
or school organization issue properly requiring a population of voters. 
Also, elements in the population may be duplicated in available lists. For 
instance, if the sampling element is the family, a list of all school children, 
if used in sample selection, would include some families in greater propor- 
tion than others because of the variation in numbers of children per family. 


Alternative Sample Designs 


There is a large variety of probability sample designs for which it is not 
necessary for all samples to be equally probable or individuals sampled 
independently—alternatives to simple random sampling, the unrestricted 
random design. One type of such design is known as stratified-random 
sampling. According to this plan, a population of N units is divided into 
subpopulations or strata of N,, N2,..., Ny units. Simple random sampling 
is then executed within each stratum. If, on the basis of judgment or some 
previous information, units may be thus classified so that strata are rela- 
tively homogeneous, and so that the variance between groups is large, 
stratification may reduce the sampling error of the estimate. As long as 
the sampling within strata is carried out by a random process and every 
individual in the population has a known probability of being drawn, the 
precision of the sample may be determined. 

The allocation of sample units among the strata may be proportional, in 
which case equal weightings are used in deriving estimates. Optimum allo- 
cation may be used to maximize efficiency of the sample or to minimize cost. 
Stratified random sampling is not always more efficient (lower sampling 
variance) than simple random sampling. If the allocation of the sample 
among strata is far from optimum, a stratified sample may have higher 
sampling variance. If cost per sampling unit is the same in all strata, there 
is little difference in the efficiency of optimum and proportional allocation 
(3, 21). 

Dalenius and Gurney (11) considered the problem of optimum stratifica- 
tion in practical surveys and showed that increasing the number of strata is 
not enough in itself to insure a decrease in sampling variance. Another 
paper by Dalenius (10) considered a type of design using two strata, all 
units of one stratum being included. This is a design of practical interest 
in the sampling of highly skewed distributions. A comparison was made of 
this design with the minimum variance (optimum) design. A stratification 
by community size in the state of Washington was reported by Showel (44). 

Stratification is usually combined with some form of cluster sampling. 
This involves the dividing of N elements of a population into groups (e.g., 
schools, classrooms, city blocks, areas) such that sampling units contain 
more than one element of the population. Ordinarily this is combined with 
sampling in stages. Sampling of clusters is combined with a subsampling 
of elements within clusters. A two-stage sample is illustrated by drawing 
classrooms (clusters of elements) for a sample of school children. Sampling 
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counties, subsampling school systems, then subsampling schools or class- 
rooms, and so on, would be multi-stage sampling. 

In general, cluster sampling results in loss of efficiency as compared with 
the simple random design. Only if the variance within clusters is greater 
than the total variance is cluster sampling more efficient than unrestricted 
random designs. This rarely happens. A serious underestimation of samp- 
ling variance is thus usually a consequence of the application of simple 
random formulas for error variance to samples actually drawn by this 
method. A very important principle in survey sample methods is that the 
sampling error formula depends on both the type of estimate used and the 
particular sample design. 

In addition to examples of multi-stage sampling in the general references 
noted above, Hansen and Hurwitz (20) reported an example of a dwelling- 
unit sample of a city where no adequate list of people exists. The authors 
showed steps involved in the use of subsampling and appropriate methods 
of estimation. A study of the application of various two-stage sample designs 
and estimating procedures in North Carolina was reported by Jebe (25). 
He compared methods of selecting primary units with equal probability 
and with probability proportional to size. He assessed the contribution of 
the within primary sampling unit error component to the total survey 
sampling error. 

A general method for dealing with samples without replacement from a 
finite population when variable probabilities of selection are used for the 
elements remaining prior to each draw was reported by Horvitz and 
Thompson (24). They give unbiased estimators and sampling variances 
of the estimators and apply their approach to one-stage and two-stage de- 
signs. Sampling without replacement and with unequal probability was also 
considered by Midzuno (33). 

Double sampling entails sampling a population in two phases or in two 
parts so that information from the first part may be used in determining 
allocation and sample size for the second part. A comparison of double 
sampling and single sampling based upon a magazine readership survey 
was reported by Robson and King (38). Two other papers on double 
sampling appeared during the period. They both deal with a double samp- 
ling scheme which has been developed for estimates of the mean of a 
normal population and which has the advantage that the expected number 
of observations is less than the number required for the ordinary single 
sample plan. A further advantage of this scheme is that it guarantees that 
the confidence interval will be no larger than some stated amount. Owen 
(36) extended the theory underlying this approach summarized by Cochran 
(3: 61). In Owen’s two-stage sampling plan the size of the second part 
of the sample depends upon information supplied about the variance of the 
population by the first part of the sample. Seelbinder (43) examined the 
problem of selecting an optimum size for the first part of the sample. Cox 
(9) also reported on the use of a preliminary sample to determine how 
large the total sample should be. An application of sequential sampling 
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which would be applicable in resurvey work, school census work, and 
various psychological problems related to education was reported by Good- 
man (18). These are applications of some of the ideas of sequential samp- 
ling which may find applications in education with types of populations 
which are difficult to list or for which advance information is not available. 

Systematic sampling has a very special meaning in statistics. A sample 
procedure may be very methodical, orderly, and systematic without being 
classifiable as systematic sampling. This term is used by statisticians when 
the process of selection is that of taking n units from the N units of the 
population, arranged in some order from 1 to N (as in a list or file), by 
taking every kth individual after a random start from 1 to k. This has been 
shown to be a special case of cluster sampling (3: 152). It is thus more 
precise than simple random sampling only if the variance within the 
systematic samples is larger than the population variance. It is really not a 
sample design but a method of selection of elements in sampling. It is used 
in stratified and cluster sampling designs and in various stages of multi- 
stage sample plans. The Madow studies on this subject are classic; the third 
in this series (31) considers conditions under which three sampling meth- 
ods—random-start systematic, centered systematic, and stratified random— 
are superior as compared with one another in efficiency. A good general 
treatment of this subject may be found in Chapter VIII of Cochran’s work 
(3). 

Sample design is complicated when the objectives of the sample consist 
of several measures or characteristics of the population. If advance infor- 
mation consisting of several measures exists about the population and this 
information is used in planning the sample, sample size may be determined 
on the “most important” measure, or sample size may be determined for 
each measure, and the largest sample taken. Moonan (34) contributed a 
method of making independent estimates. He reported appropriate estimates 
of means for different variates from the same sample, taking into account 
dependence of such estimates. 

Administrative considerations sometimes prevent random sampling within 
strata. Stratified designs which do not involve some kind of automatic 
random selection within strata are, by definition, not probability samples. 
The quota sample is widely used in market research and in opinion-polling 
organizations because of some of the difficulties in designing probability 
samples. There is thus some interest in the comparison of quota samples 
with probability samples. A conference of Gallup Institutes from 10 coun- 
tries reported that the appropriateness of probability or random sampling 
depends upon the nature of the problem being investigated; that a survey 
based on probability sampling is likely to have a failure rate between 15 
and 20 percent even after three recalls; that speed of interviewing is in- 
creased and failure rate decreased if interviewing is confined to evenings; 
that interviewing costs are three to five times greater in probability than 
in quota sampling; and that no divergence greater than 3 percent was 
found between a factual survey based on probability sampling and the 
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survey repeated by quota sampling (16). On the same subject, Haner and 
Meier (19) reported that probability samples cost no more for actual 
field work, altho they do cost appreciably more for planning. They found 
probability samples yielded about the same answers as quota samples on 
opinions. Their report was based on a comparison of two area probability 
samples and two quota samples. 


Errors in Surveys 


The appropriate measurement of standard errors due to sampling is 
strategic in the making of interval estimates. Such errors may be computed 
for many different types of estimates (e.g., of total N by ratio estimates, 
regression estimates, mean-per-unit estimates, and the like) for various 
sample designs. There are some difficulties in that exact distributions of 
some types of estimates are biased or unknown. Approximation methods 
are often used in interval estimation on the basis of the knowledge that for a 
large class of finite populations the distributions of the sample mean tends 
to normality even if the sample fraction is high and sampling is without 
replacement (3). 

Tables and charts available to statisticians for purposes of determining 
confidence intervals of proportions apply to samples from an infinite popu- 
lation or to samples with replacement. Survey sampling, however, is most 
frequently made without replacement and often from small finite popula- 
tions. Under these conditions, confidence interval estimates of P pre- 
sent a problem. Katz (28) supplied three methods of interval estimation 
to three dissimilar cases. He concluded that an approximation method 
based on the chi-square distribution with correction for continuity agrees 
more closely with results of the exact (hypergeometric) method than an 
approximation method without the correction. Woodruff (59) reported a 
method which has been used in the U. S. Bureau of the Census for obtaining 
confidence intervals for medians and other position measures, using a 
principle that may be applied to any type of probability sample. It does 
not depend upon the assumption that the distribution is normal or is of any 
other special type. 

Unfortunately, the sampler has many types of errors other than sampling 
error to take into account in his work. Errors arise from many sources such 
as nonresponse, errors of measurement, errors in the preparation of the 
estimates, errors due to the fact that the population characteristics change 
with time, and errors of processing (2). A case history of a survey, The 
Post-Enumeration Survey of the 1950 Census, by Marks, Mauldin, and Nis- 
selson (32) directed attention to points at . hich decisions must be made 
on the basis of intuition because of gaps in the knowledge of survey tech- 
nics. It deals particularly with specialization of interviewers, checking ac- 
curacy of interviewing, design of the interview, timing, and selection and 
training of interviewers. Stock and Hochstim (45) reported on the reduc- 
tion of interviewer variability by such means as more careful training and 
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supervision of interviewers, restricting the freedom of interviewers in se- 
lecting respondents, greater care in sample designs, increasing the number 
of interviewers to cancel out interviewer variance, and the selection of 
interviewers from among the population to be studied. 

One of the most disturbing classes of error and one which has received 
considerable study, whether the survey is by mail or by interview, is the 
error of nonresponse. This source of bias is often unknown; hence the 
consequences may be devastating to results of a survey. Methods of dealing 
with errors of nonresponse include increasing initial canvass and the call- 
back (recanvassing or reinterviewing nonresponses). Deming (13) de- 
veloped a plan that will produce a calculable variance, a calculable bias of 
nonresponse, a calculable cost, and the number of recalls required to reach 
a desired accuracy at minimum cost. His scheme deals with the allocation 
of effort between the initial sample and the call-backs. 

Recent information on the kind of bias due to nonresponse in an edu- 
cational mail survey was reported by Rothney and Mooren (40). Using 
the scheme of mailing again and repeatedly remailing, they managed 
complete coverage of 369 former students in a follow-up of four high schools 
in Wisconsin. They showed not only how much bias enters into nonresponse 
in a survey of this type, but also who responds and who does not. Early re- 
sponders tended to be females from unbroken homes living in urban areas 
who had been intensively counseled, who had above average rank in graduat- 
ing class, and who were above average in intelligence. 

Methods of processing survey data are of interest not only because of the 
kinds of errors which they may produce but also because of efficiency 
which may be gained thru careful processing plans. Klein and Morgan (30) 
reported results of alternative statistical treatments of data from the sur- 
veys of consumer finances conducted by the Survey Research Center, Uni- 
versity of Michigan, for the Board of Governors of the Federal Reserve 
System. They considered the weighting of observations by use of differential 
sampling rates and differential response rates, heteroscedasticity, and use 
of nonlinearly related variables. Voight and Kriesberg (55) reported valu- 
able experience of the U. S. Bureau of the Census with procedures to in- 
crease the efficiency of processing census sample surveys. Many survey 
designs involve the use of weightings within groups of observations (because 
of nonresponse or because of the sample design) in computing the estimate 
of the population parameter from the sample. A common occasion for this 
is weighting in the estimation procedure for stratified designs with dis- 
proportionate numbers of cases in strata. Roshwalb (39) compared arith- 
metic and card-duplication methods of weighting. 


The U. S. Census and Education Surveys 


There are undoubtedly more applications of advanced probability sample 
designs to educational data by the U. S. Bureau of the Census than by all 
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other agencies together.’ Educational researchers interested in learning 
survey sampling methods will find that they can study designs in the con- 
text of educational data from this source. The general methods of attack 
are highly suggestive of types of studies which may be made within states 
or regions of the United States where sampling methods are applicable. 

In connection with the U. S. Census, Current Population Survey (includ- 
ing the Monthly Report on the Labor Force), a question on school enrol- 
ment is usually included in addition to the regular questions every October 
(49, 50, 51). Occasionally, information on school enrolment is collected 
in April. At times questions are also asked on educational attainment and 
illiteracy (50). A rather complete description of the design, essentially as 
it is now used in the foregoing sample surveys, appears in Volume | of 
Hansen, Hurwitz, and Madow (21: 559). 

During the 1950 Census, a 20-percent sample was used for obtaining 
information on school enrolment and educational attainment. Tabulations 
were made for a 20-percent sample and also for a 314-percent sample (54). 

The Governments Division of the Bureau obtains, as part of the public 
employment picture, data on school employment and payrolls by means 
of sample surveys. Annually for the month of October estimates are made 
on a state basis. Four times a year, for each of the three months in the 
preceding quarter, estimates are made on a national basis utilizing a sub- 
sample of the October (or state) sample (47, 48, 52). Financial data con- 
cerning revenues, expenditures, and indebtedness are obtained annually 
for fiscally independent school districts from the national sample. Expen- 
ditures for education by level of government are obtained from the national 
sample, which is supplemented by the inclusion of all cities over 25,000 in 
population (53). 

In the surveys of state and local-school employment and payrolls, the 
local-school employment component is derived from a sample plan which 
specifies a coefficient of variation of 2 percent for school districts. The 
“state sample” of school systems numbers 8804 units, of which 418 are 
public institutions of higher education and 8386 are local-school systems. 
For these estimates, the census includes all state institutions of higher edu- 
cation for each state. Complete card listings are made of all local-school 
systems, including state-administered local schools in North Carolina, Dela- 
ware, Maine, and certain other state-administered local schools in other 
states. The listings show enrolment size factors for each system. Sampling 
is based upon a stratification according to size at optimum sampling rates. 
Since enrolment information is available on all elements, a ratio estimate 
is used. q 

The “national sample” of school systems numbers 1074 units of which 
418 are public institutions of higher education and 656 are local-school 


* The following information is based primarily upon information supplied by Morris 
H. Hansen of the U. S. Bureau of the Census. Parts of the immediately succeeding 
paragraphs are almost verbatim information supplied by him. 
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systems. This sample includes all state institutions of higher education. 
Subsampling of units from the state sample is used with optimum samp- 
ling ratios required for the national estimates at stated levels of reliability. 
Ratio estimates are used by relating reported data to a base period (the 
preceding October) in sampled units and applying the ratios to national 
estimates developed from the “state sample” for the preceding October. 


Other Applications in Education 


This review does not attempt to include all the studies in education which 

used survey sample plans and which appeared during the three-year pe- 
riod. In preparing this review the usual sources of indexing have been 
consulted. All references to sampling in education in these sources were 
examined. Many references did not involve survey sampling methods and 
were excluded. There were undoubtedly many master’s and doctor’s theses 
and special studies in a number of areas that involve some serious attention 
to survey sampling theory that have been missed. It is doubted by this re- 
viewer that, even if a complete search of applications were made, it would 
be found that survey sample methods in education in general were being 
utilized as fully as the census studies outlined in the previous section. 
_ There are known to the writer surveys of opinion concerning schools 
and surveys of school populations in connection with surveys, the results 
of which have not been published in such form as to yield information on 
sample design. One published example which appeared during the period 
is the Pasadena Survey (23). Altho this is one of the most nearly com- 
plete survey reports, there is little information on the survey design of 
the opinion poll used other than the indication that there was stratification. 
The report suggests that it was not a probability sample. There were no 
statements of confidence levels and possible types of error. 

There were a few studies in education during the period which repre- 
sented advances in methods in this field. An example of a multi-stage, 
probability sample with careful planning is a survey of literacy in Japan 
(22). Thorndike and Hagen (46), under Air Force contract, undertook 
an exploratory study of the feasibility of a national aptitude census. Re- 
sults of their pilot studies suggest the types of response, the various biases, 
and estimated costs in such a survey. A Census Bureau survey, as a sup- 
plement, checked upon the use of prelisting made by the census in con- 
nection with the Current Population Survey. The statistical methods used 
in “preplanning” a comprehensive national program such as this should 
be of considerable interest to persons interested in such matters as the 
establishment of norms in the standardization of educational tests. 

Cocking (6) used a stratified-random design of urban schools in the 
United States in the study of diffusion of educational practices. He strati- 
fied within each of six regions of the United States on the basis of expen- 
diture-per-pupil unit and size of school system. The allocation of sample 
units among strata was made on the optimum efficiency plan, which re- 
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quired variable weightings and estimates. It is somewhat difficult to judge 
the adequacy of use of sampling methods by Cocking since his objective 
was not enumeration. He based his sampling allocation upon the expendi- 
ture measure, but the characteristics of the population he sought were 
percents of presence of certain educational practices. An interesting device 
he used was taking three separate samples in order to split three ways 
the burden of items to which he sought response. His report indicated no 
evidence of making full use of the scheme of three parallel samples in 
measuring the effectiveness of his design. In one respect, however, it was 
successful in that his response to mail questionnaire was unusually high as 
education surveys go. Each of his three samples amounted to approximately 
400 cases. This was considerably higher than the number required by his 
application of optimum allocation computations. It appears that he used 
confidence intervals for his percents which were based upon the unre- 
stricted-random variance formula altho the design was actually a stratified 
design. This is of common occurrence in educational literature. 

Schunert (42) gave achievement tests to 100 schools from the popula- 
tion of Minnesota’s 522 public secondary schools. His design was stratified 
random. He used 42 categories of high schools based upon size and type of 
organization. He used proportionate allocation among strata and sub- 
sampled classes within schools. Strictly speaking, therefore, his was a 
multi-stage design involving cluster sampling. It turned out that the char- 
acteristics which he reported deal with teachers as sampling elements and 
methods of instruction using the classroom as the unit. Also, the individual 
pupil in some statistics became the unit in his report on achievement-test 
results in algebra and geometry. Like the Cocking study, Schunert’s use 
of stratified sampling appears to be primarily for purposes of achieving 
administrative convenience and insuring “representativeness.” The objec- 
tive was apparently analytical, not enumerative, since he does not indicate 
that he made a distinction in his design, or in his error formulas, between 
objectives of enumeration and objectives of analysis. 

Most educators are interested in comparisons among strata or in analy- 
tical survey objectives, as the previous two studies indicate. As Deming 
(14) has pointed out, sometimes the objectives of a survey may be both. 
He compared the two distinct types of sampling distributions involved 
in enumerative and analytical interpretations of samples and showed that 
the distinction is important in order to achieve economy in collecting 
and tabulating data and in interpreting results. Deming’s emphasis on this 
distinction is important to workers in education, who tend to use methods 
of selection primarily for purposes of assuring “representativeness,” such 
as stratification, without much concern for the elements of cost or the ele- 
ments of precision. 

A study by Weatherford and Stancil (57) is an example of sampling, the 
purpose of which was to compare certain characteristics of two clusters of 
individuals. The individuals were students in institutions of higher edu- 
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cation. A checklist of activities in recreational participation was admin- 
istered to 10 percent of each of four classes in two liberal arts institutions 
in North Carolina. Weatherford and Stancil referred to their study as 
stratified sampling. In effect, it was cluster sampling. There was no indi- 
cation of whether or not the proper variance formulas were used, altho 
critical ratios were published. The objective was clearly analytical and 
not enumerative. 

A study by Coffman (7) reported on the relationship of teacher morale 
to curriculum development. Coffman’s study is an example of a non- 
probability sample. The sample began with a group of school systems of 
convenience. The investigator took steps to “round-out” the sample so that 
it would be more “representative.” It is unfortunate that limited resources 
for research very often force educators into this position. 

A very interesting, special type of sampling problem which may be 
classed as educational research was reported by Goodman (17). The prob- 
lem was estimating from samples of lists of acquisitions of books from 
various libraries the number of common items, i.e., overlap in acquisitions. 
The sort of statistical literature of interest in administrative decisions con- 
cerned with anticipating the expanding needs for school facilities is il- 
lustrated by a paper by Schmitt (41) on short-cut methods of estimating 
county population. Wittenborn (58) criticized the application of sample 
statistics in psychology, the use of inadequate sample designs, and the mis- 
application of sample statistics. In speaking of the necessity of avoiding 
“bias,” Deane (12) wrote as if he meant by that, “getting as representative 
a group of measurements as possible.” He indicated that characteristics 
being measured must be distributed as nearly as possible in the same 
proportions among persons in the sample as in the population from which 
it is drawn. 

There appear to be many misconceptions concerning sampling which 
are prevalent in education. The purpose of randomization is not to get a 
better (or more “representative”) sample, but to enable the sampler to 
use known distributions in specifying his precision. The definition of bias 
as difference between the expected value of a sample estimate and the 
parameter is important. The idea is apparently still prevalent that samples 
should be “purposive,” meaning that the best way to sample is to “hand 
pick” elements from the population in such a manner that “representative- 
ness” will be assured. A characteristic of samples which produce measures 
of precision is that the selection procedures within the limitations of the 
design are automatic and random. 

It is often not too difficult to get good response on factual information 
from schools and school systems, as the Cocking study suggests. This 
means that there should be no trouble in many educational surveys in 
taking advantage of various probability sample designs. In this connec- 
tion, it is to be observed that improved understanding of the importance 
of the sample design problem is revealing itself in educational literature. 
For instance, Remmers (37) has prepared a text on opinion and attitude 
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measurement which includes a chapter on sampling. He no doubt aimed 
his treatment at the level of statistical training properly to be expected 
of his students, since his treatment of statistics is not sufficiently advanced 
to permit students to make full utilization of sampling theory in economi- 
cally selecting among alternative designs and making accurate statements 
of the precision of surveys. Effective sample design is not open to students 
whose knowledge of statistics is limited to the level of the conventional 
pq/n formulas for simple random sampling from an infinite population. 
If, as suggested earlier, sufficient advances are made in revision of courses 
in elementary educational statistics on the order of Walker and Lev (56), 
in another decade or so a Remmers type of publication might justifiably 
include sampling statistics at the level of Kish (29). 
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CHAPTER II 


Scaling 


FREDERIC M. LORD* 


caine” appears to be virtually indistinguishable in meaning from 
“measurement,” which may be defined as the assignment of numerals 
to objects (including people), according to some rule, in order to repre- 
sent their properties. If a method of scaling rests upon a verifiable hy- 
pothesis as to the existence of some specified self-consistency in the data, 
then the data may or may not be found to scale. To this extent such a 
method of scaling is not arbitrary and thus becomes of theoretical as well 
as practical interest. In order te keep the present review within bounds, 
many scaling methods that are highly arbitrary will be largely omitted 
from consideration. Such methods include the use of arbitrary index 
numbers or weighted composites, the simple counting of “correct” re- 
sponses, Likert’s method of summated ratings. Such omission should not 
necessarily be construed as disparaging the method in question. Discus- 
sion relating to the arbitrary choice of an origin and unit of measurement 
will also be omitted. 

Because of the recency of important developments in the field of scaling, 
the only fairly comprehensive survey of the area available is briefly con- 
tained in a chapter by Green (60), written in the context of attitude meas- 
urement. Goode and Hatt (59) and Remmers (122) have devoted three 
chapters and one chapter, respectively, to readable surveys of much of the 
field as it existed before publication of Volume IV of Studies in Social 
Psychology in World War II (139). The reader wishing a thoro founda- 
tion in the subject will wish to study the text by Guilford (67). Goheen 
and Kavruck’s bibliography (58) will also be of help. 


Theory of Measurement 


Theory of measurement, at least with respect to mental measurement, is 
still highly controversial. Stevens (137) summarized his own well-known 
theory of measurement and ably discussed the problems and methods of 
psychophysics and the nature of mathematics and of mathematical models. 
No sharp disagreement with this theory was voiced by Guilford and 
Comrey (68) or by Lorge (104). 


* The reviewer is indebted to Dr. Harold Gulliksen for many helpful suggestions 
and comments. He is indebted to Dr. Warren Torgerson for the opportunity to read 
and discuss drafts of many of the chapters in Torgerson’s forthcoming monograph 
on theory and methods of scaling, now being prepared under the sponsorship of the 
Social Science Research Council. The reviewer has benefited from Dr. Torgerson’s 
helpful comments on a draft of the present manuscript. 
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Coombs, Raiffa, and Thrall (33) gave a discussion of mathematical 
models and described a variety of types of scales, many of them new. 
Weitzenhoffer (160), after discussing various types of scales, asserted 
that it is utterly impermissible to plot on graph paper or otherwise to 
investigate the relation of psychological magnitudes, such as judged weight, 
to physical magnitudes, such as actual weight, since the former magnitudes 
do not possess a “multiplicative” scale. The reviewer disagrees. 

Comrey (27) asserted that, if a “definite set of operations has been 
employed to lend meaning to the unit established,” this is sufficient to 
justify the calculation of means, even tho the measurements do not possess 
an interval scale. He subsequently (26) urged that in the absence of an 
interval scale, computation of means can be justified by showing that this 
does not lead to serious practical error. He suggested that most workers 
should not try for fundamental measurement, since this can be attained 
only by observing physiological manifestations of ability. 

Burke (20) asserted that the properties of the scale of measurement 
should have no effect on the choice of statistical technics used to represent 
and interpret the measurements. Senders (133) correctly pointed out that 
conclusions reached in this way relate to the numbers but not necessarily 
to the objects measured. Lord (100, 101) showed that means and standard 
deviations may in certain circumstances be computed and rigorously used 
for testing a null hypothesis even tho the scale of measurement has no 
interval, or even ordinal, properties. To disagree with this conclusion 
is to assert that no generalizations whatever can be formulated regarding 
the magnitude of the sum of a random sample of numbers drawn from a 
table of random numbers or from a roulette wheel; yet Bennett (13), 
and apparently also Behan and Behan (8), disagreed. 

Hempel (80), in the most basic article since that by Bergmann and 
Spence, recently reprinted (16), asserted that “fundamental” measure- 
ment can exist in the absence of any operational definition of addition, 
as evidenced by the mel scale of pitch and other scales derived by frac- 
tionation. Even mass is not additive in relativistic physics. Hempel quotes 
Schlick as pointing out that the pulse beat of the Dalai Lama could prop- 
erly be considered as defining equal units of time, altho such a definition 
would complicate the expression of natural laws. 

The reviewer believes that if the units of measurement used for a certain 
mental trait are not accepted as equal along the scale, this is because no 
scale has been devised that will lead to a really simple formulation of 
natural laws involving this trait. If and when clear simplifications of 
scientific theory are possible, then equal units, additive scales, or multi- 
plicative scales will promptly become accepted and used. In this sense, then, 
the measurement scales of the social scientist possess all the properties 
for which he presently has any clear use. If we are asked to provide a 
scale with “equal” units in some area, it will be enlightening to insist upon 
an operational definition of the word “equal” as used in the request. 
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Coombs’s Theories and Methods of Scaling 


Coombs wishes to see whether or not his data exhibit strict regularities 
that can be observed without resort to the classical theories of errors 
of measurement (70: Chapters 1-5; 158: Chapter 12) to explain away 
small irregularities. The fundamental ideas underlying Coombs’s philosophy 
of scaling were outlined in (28). Coombs’s monograph (31) has been 
reviewed by Edwards (38) and by Green (65). Much of the material 
in the monograph, including his theory of data, was also treated in a 
later publication (30). Only a few special points can be mentioned. Coombs 
discussed various types of scales, including the ordered metric, in which 
something is known about the relative magnitudes of the distances between 
objects. Coombs’s method of similarities, a modification of his unfolding 
technic (29), was illustratively used to scale private (A), corporal (B), 
buck sergeant (C), master sergeant (D) on a continuum of authority. 
O (the subject or judge) considered these three at a time, each time stating 
which pair seemed most and least alike. For each of two O’s, Coombs found 
unidimensionality, the order being ABCD. The distances were partially 
ordered with AC >CD>BC and BD>AB>BC. 

Bennett (14) extended unfolding technic to n dimensions. Bennett (15), 
Coombs and Kao (32), and Milholland (111) theoretically extended 
Guttman’s perfect scale to n dimensions. 

Coombs (30) dismissed conventional mental test theory in one or two 
paragraphs as actuarial—of great practical but of little theoretical value. 
The mental test theorist will probably feel that Coombs’s scales will in 
practice not be found to exist, unless Coombs agrees to allow random 
fluctuations about the rigorous theoretical model. It is, nevertheless, likely 
that Coombs’s varied theories will prove to be fruitful in many respects. 


The Measurement of Utility 


An important field of scaling has been brought into focus by Edwards’ 
excellent review (40) of 209 titles, mostly in the field of economics, relating 
to the scaling of utility. Mosteller and Nogee (116) have applied standard 
psychometric methods to scaling the utility of money. The most important 
work in the field is now being done by Coombs, Edwards stated. The reader 
is referred to Edwards’ article for further details. 


Psychophysical Scaling 


The fractionation technic has recently been used to set up interval scales 
for sensed sweetness (105), pain (144), and passage of time (66, 128). 
In the first of these studies, some of the underlying assumptions of the 
method were checked experimentally—something that should be done 
much more frequently. Armington (6) claimed that Harper and Stevens’ 
standard formula for scaling by the fractionation technic is inadequate; 
he presented a more general formula. 
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Davis (36) asserted that the available standard errors for the traditional 
scaling methods involving the psychometric curve are incorrect. Ignoring 
the fact that these methods are justified by the principle of maximum likeli- 
hood, as authoritatively discussed by Finney (46), Davis suggested a new 
and apparently inferior method. Harrison and Harrison (78) modified 
the traditional method by introducing a correction for guessing; Tiffin 
and Rabideau (150) found that results obtained by the modified and 
unmodified methods correlated .995. Perry (121) showed that biserial 
correlation methods can be applied for computing limens. Many articles 
of less general implications for scaling are omitted here. 

Volkmann (157) outlined a philosophy of education based on the fact 
that psychophysical judgments vary with the anchoring stimuli available. 


Paired Comparisons 


Punched-card procedures for Thurstone’s Case V (which assumes equal 
standard deviations) have been made available (89, 119). Simpson (135) 
tabulated various experimental designs for use when all judges cannot rate 
all stimuli (S). In one of these designs, every S is compared with every 
other S once and only once. 

Suci, Vallance, and Glickman (142) found that the paired-comparisons 
method showed considerably higher split-half reliability than did five other 
methods used to rate officer-candidates on “promise as future officers.” 
Witryol (161) reported correlations of essentially unity between scale 
values found by Case III, by Case V, and by Guilford’s shortcut method. 
Newhall (117) reported correlations above .90 between the scale values 
obtained using the same 12 judges for each of the two methods, and above 
.77 when 12 different judges use each method. Mosteller (112) showed 
that Thurstone’s Case V method is a least-squares solution, and furthermore 
does not depend, as previously thought, on the assumption that the dis- 
criminal processes are uncorrelated, but merely assumes their inter- 
correlations to be equal. The least-squares method still does not take into 
account the large standard errors of scale values derived from extreme 
proportions. Mosteller (113) investigated the results of applying Case V 
when the standard deviations are not equal. He (114) also provided 
a large-sample significance test for Case V—a development of great 
importance. 

Gibson (56) outlined the standard least-squares solution for the simul- 
taneous linear equations that constitute Thurstone’s Case IV. Burros (21) 
and Burros and Gibson (22) presented two new approximative methods 
for dealing with Case III. In case the second method yields a solution 
with negative variances, their rationale calls for adding one to each negative 
variance and dividing the sum by two so as to obtain a positive variance! 

In Scheffé’s analysis of variance for paired comparisons (130), the 
judges rate the stimuli on a 7-category verbal scale, the categories being 
assigned arbitrary numerical values for purposes of the analysis. Bradley 
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and Terry (18) derived maximum-likelihood estimates and significance 
tests for paired-comparisons data on the seemingly arbitrary assumption 
that if the scale value of stimulus i is x; > O with =x; — 1, then the proba- 
bility that x; will be preferred to x; is given by x;/(x; + x,) for all i and j. 


Equal-Appearing and Successive Intervals 


Thurstone-type attitude scales and the methods of equal-appearing and 
of successive intervals, used especially in the selection of items for these 
scales, were treated in texts by Ferguson (45) and by Jordan (84). 
Webb (159) reported a successful empirical check of the assumption 
of generality for a Remmers-type generalized attitude scale. 

Fehrer (44) showed that the scale values found for a Thurstone attitude 
scale for certain of the statements varied with the range of attitudes repre- 
sented in the set of statements submitted to the judges for scaling. Similarly, 
Hovland and Sherif (82, 134) reported that the judges’ opinions affected 
the scale values found for attitude statements. These conclusions do not 
seem to be changed greatly by Gibson’s comments (54). Kelley and others 
reported, according to Green (60), that Hovland and Sherif’s discrepancies 
largely vanish if the method of paired comparisons is used on the data 
instead of the method of equal-appearing intervals. 

Edwards (37) reported a high linear relation between scale values 
obtained by the method of successive intervals and by the method of paired 
comparisons. Edwards and Thurstone (39) reported that the assumption 
of equal discriminal dispersions did not greatly change results obtained 
by this method. Attneave (7) and Garner and Hake (53) each suggested 
modified methods for dealing with successive-intervals data, assuming 
discriminal dispersions to be equal. Garner (52) applied his method to 
data on loudness judgments and found that the resulting scale agreed 
with the usual scale of cumulative difference limens. He also observed 
a phenomenon quite parallel to Fehrer’s. 

Rimoldi and Hormaeche (127) and Torgerson (152) derived formulas 
generalizing the method of successive intervals by allowing the category 
boundaries to have discriminal dispersions, as well as the stimuli judged. 
Gulliksen (69) derived a least-squares solution for successive intervals. 
The quantity minimized is not the one that would have been chosen by the 
reviewer; but the result is formally equivalent, Gulliksen pointed out, 
to Horst’s method for dealing with a matrix of incomplete data. 


The Method of Triads 


The usual Thurstone scaling technics can be applied to scaling the 
“distance” or degree of similarity between pairs of objects. Attneave (7) 
raised the very important question whether such psychological distances are 
additive (Euclidean). He reported five experiments apparently showing 
that they are not. More work should be done on this basic problem. 
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Torgerson (153, 154) developed practical methods for determining the 
troublesome unknown additive constant in the method of triads. He scaled 
colors, obtaining results in agreement with the Munsell system and 
representable in Euclidean space. Messick and Abelson (109) derived 
a valuable improved method for determining the additive constant. 

The work mentioned throws open an important area in which rapid 
developments may now be expected. 


Recent Developments by Thurstone 


Thurstone’s law of comparative judgment as originally stated assumes 
that the discriminal processes of a single O in repeatedly judging a single S 
are normally distributed. The practical worker using the method of paired 
comparisons or the method of successive intervals almost always proceeds 
as if normality would still be expected when repeated judgments of one 0 
are replaced by the single judgments of many O’s. This may be reasonable 
for certain types of S, but is obviously nonsense for certain others. Further- 
more, the necessary assumptions about the intercorrelations of the discrimi- 
nal processes cannot be expected necessarily to hold when different 0's 
are used. Thurstone (149) discussed these problems and outlined a number 
of new methods for dealing with both types of difficulties. Important 
advances may be expected to result. 

In the same paper, Thurstone briefly reported his “birthday-gift study.” 
showing that the scale value of A plus the scale value of B is equal to the 
scale value of A +- B, as determined by the method of paired comparisons. 
The result is of great importance theoretically, since wherever it is con- 
firmed, it clearly gives operational meaning and justification to the processes 
of adding scale values and of computing means. 

Applications of various scaling methods to studies of food preferences 
were reported by Thurstone (146, 147). He (148) gave a nontechnical 
survey of his varied work and ideas in scaling. 


Ranks 


Cureton (35) pointed out that averages of ratings or rankings obtained 
from a few judges are more variable than those obtained from many 
judges and indicated a method to obtain comparability. 

Anderson, Gray, and Kullstedt (2) presented tables for transforming 
ranks into normal deviates cutting off an area of the normal curve equal 
to 100 (Rank — 4%) /N, where N is the number of objects ranked. For 
many purposes, Fisher and Yates’s table (47: Table 20) will be more 
appropriate, giving the expected value of the r-th observation in order 
of magnitude when a sample of size N is drawn from a normal population. 
Ehrenberg (41), however, raised serious questions regarding Fisher and 
Yates’s suggestion that ranks converted to expected values can properly 
be used in an ordinary analysis of variance. 
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Ratings 


Bendig (9, 10, 11) and Bendig and Hughes (12) reported little relation 
of number of rating categories to reliability of ratings, but a considerable 
relation to “transmitted information,” as defined by Garner and Hake (53) ; 
reliability and transmitted information were both increased by verbal 
anchoring of the categories. 

Helson, Michels, and Sturgeon (79) advanced new evidence to show 
that a usual type of rating scale with nine verbal categories can be used 
in psychophysical studies if the categories are scored 1 to 9. The data 
verified their theoretical prediction, based on their reformulation of 
Fechner’s law, that the resulting scores will be a linear function of the 
logarithm of the physical magnitude of S. Michels and Helson (110) 
reported on the time-order effect, further supporting this reformulation. 


Sociometry 


Pepinsky, Siegel, and Vanatta (120) used Thurstone scaling methods 
to build a “Group-Participation Scale” composed of “guess-who” items. 
Witryol and Thomson (162) obtained social-status scores by Moreno’s 
“partial-rank-order” method and also by Thurstone’s paired-comparisons 
method. Retests after one and five weeks in four groups yielded median 
reliabilities of .75 and .96 for the two methods, respectively. Katz (86) 
and Seeley (132) devised matrix methods for measuring influence or status, 
taking into account who chooses whom. 


Guttman Scales 


Most of the foregoing material, except for some of the work of Coombs, 
has been concerned with situations in which only the stimuli (or individuals 
treated as stimuli) are to be scaled. In most of the remaining discussion, 
both stimuli and judges (or test items and examinees) are to be scaled 
simultaneously, thus avoiding certain assumptions made previously. 

Volume IV of Studies in Social Psychology in World War II (139), 
devoted primarily to Guttman’s work, has been widely reviewed (23, 42, 
106, 108, 143), special articles being devoted to it in many cases. Madow 
(106), in a generally very favorable review, pointed out the important fact 
that the identification of Guttman’s second principal component as intensity 
was based only on the fact that both have a U-shaped regression on attitude; 
other psychological functions might also have such a regression. McCarthy 
(108) raised the question as to whether enough data will be found that 
sufficiently approximates Guttman’s perfect scale for the model to be 
of wide practical value. Eysenck (42) raised the same question and also 
criticized certain of Guttman’s claims; Guttman (73) wrote a corrosive reply. 

Stouffer (138) presented a slightly modified version of his very helpful 
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Chapter 1 (in 139), discussing Guttman’s and Lazarsfeld’s work. Green 
(61), discussing Guttman’s work, suggested that the similarity of the 
second principal component to intensity is quite accidental. Certain other 
difficulties have been discussed (74, 131, 136). Very recently, Guttman gave 
a new, nonmathematical discussion of his theories (76). 

Various technics for reducing the labor of scalogram analysis were 
presented (57, 85, 107, 126). Green (64) reported a relatively simple 
method that uses only summary statistics. Niven (118) compared the 
reciprocal averages technic with the Cornell scale analysis technic, using 
them as methods of item selection; he found that they yielded approximately 
equally reliable final test forms. 

Jahn (83) discussed a rather implausible universe of items in which 
errors of measurement are equally likely for all O and for all S. Stouffer 
and others (140, reprinted as 141) developed H-technic, in which 
“contrived” items are used to obtain high reproducibility. A contrived 
item is composed of several actual items treated as a unit, certain patterns 
of responses being defined as constituting a “positive” response to the 
contrived item. Mosteller (115) modified Guttman’s scaling theory so as 
to apply it to noncumulative items such as are used in Thurstone-type 
attitude scales. 

Riley and others (126) published a book describing their use of Guttman 
scaling procedures to scale groups and the objects of group action. Suppose 
that in a certain group of people it is observed that pairs of individuals 
may talk about movies, or about movies and petting, but never about 
petting only. Here, movies and petting are scaled on the degree of intimacy 
which they represent as subjects for conversation. Some earlier articles 
(124, 125) were republished here with revision; one (123), dealing with 
similar questions, was not included. In Part 4 of their book, Riley and 
others (126) gave a manual of scaling procedure, showing work sheets 
and outlining IBM procedures. In Chapter 18, Toby and others described 
and compared the results of applying H-technic and the Israel Gamma 
technic of “image analysis.” 

A person’s “image score” on an item is the “best” prediction of his 
score that could be obtained knowing his responses to all other items 
(72, 151). In the situation discussed, the image score of individual X 
on item S is the score, 0 or 1, obtained on item S by the majority of the 
individuals who have the same pattern of response as X on all items except 
item S. The Israel Gamma technic, according to Toby, is simply to scale 
the image scores rather than the actual scores. This procedure tends 
to exclude specific item variance and thus to raise reproducibility. 

Guttman (72) discussed Israel Alpha technic, which the reviewer cannot 
clearly distinguish from the Gamma technic described by Toby. Guttman 
defined a universe of items as forming a quasi-scale if and only if the image 
scores of the items form a perfect scale. 

Henry (81) tried out various methods for the assignment of nonscale 
individuals to perfect-scale types, using a modification of Lazarsfeld’s 
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latent-distance technic as a criterion by which to judge the adequacy of 
each method. Borgatta and Hays (17) discussed Henry’s paper and offered 
a new method. Guttman (72) offered image scores to effect an assignment. 

Guttman and Foa (77) discussed an application of scale and intensity 
analysis to certain empirical data. Guttman (71) presented a new metric 
which has certain additive properties whose implications are not yet clear. 

Guttman (76) recently gave a nonmathematical discussion of his scaling 
theory. Only a few special points can be mentioned here. The third and 
fourth principal components were identified as “closure” and “involution.” 
respectively. “Closure” is the extent to which a person’s mind is made up; 
“involution” is the extent to which a person is actively turning an attitude 
over in his mind. It is significant that second and higher components should 
in practice be estimated by the use of extra attitude questions (e.g., ques- 
tions about intensity of attitude), according to Guttman, rather than by 
actually computing the higher principal components of the originals. 

Guttman (75) discussed several methodological fronts. Image analysis. 
he stated, makes the coefficient of reproducibility no longer necessary; it is 
a far simpler, but deeper, approach than Lazarsfeld’s latent structure 
analysis; if a true latent structure exists, the Israel Alpha technic will 
discover it with hardly any work. Nodes are similar to latent classes; 
nodular theory is a special case of image theory, but it includes latent 
structure theory and ordinary factor analysis as special cases. 


Latent Structure Analysis 


The reviews of Volume IV of Studies in Social Psychology in 
World War II generally had little to say about Lazarsfeld’s work, largely 
because of the introductory nature of his chapter. A brief but basic survey 
of his work was given by Green (60); a brief nontechnical description 
together with a practical example, was given by Rossi (129). Green (63) 
discussed the relation of latent structure analysis (LSA) to factor analysis. 

A very helpful nonmathematical discussion of the logic underlying his 
model was given by Lazarsfeld himself (92) in his recent book. Lazarsfeld 
pointed out that LSA provides a reasonable mathematical model for 
measurement whenever we have several “items” or tasks or observations 
on each O. Thurstone and Chave, in writing about their attitude scales, 
suggested that scale values might somehow be assigned to the items on the 
basis of the examinees’ responses rather than by the use of judges; 
Lazarsfeld pointed out that this is exactly what his scaling methods do. 
Various methods for estimating the parameters of the latent class model 
were given and illustrated by Lazarsfeld (93: Papers 4, 5, and 6) and by 
Lazarsfeld and Rossi (94, 95). Gibson (55) showed that the latent class 
problem can be solved by factoring a certain matrix of observed data and 
rotating the resulting factor matrix to meet certain conditions; much of the 
theory and the experience of factor analysts can be directly and successfully 
applied. Green (61, 62) derived what is probably the most efficient method 
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available. More recently, Anderson (3) presented a method having the 
great virtue that the estimates obtained for the parameters are known to 
be jointly normally distributed in large samples, permitting the construction 
of confidence regions. Two continuous models were dealt with in detail: 
(a) the latent-distance model, which includes Guttman’s perfect scale 
as a special case (93: Paper 3); (b) the model assuming rectilinear trace 
lines—an assumption leading to tremendous simplifications (93: Paper 2). 
Work is continuing on special cases of the trace line y = a + B(x — c)‘. 

Lazarsfeld and Wiggins (96) applied LSA to problems of reliability; 
Anderson (4, 5) applied LSA and the theory of Markov chains to time 
changes in attitudes. The assumptions required for the three foregoing ref- 
erences are more extensive and are even more difficult to accept than the 
assumptions required for the simpler models such as the latent dichotomy. 
Lazarsfeld suggested that the latent dichotomy model will hold for data 
obtained by lie-detector tests given to a group of thieves and law-abiding 
citizens. Altho a fairly clear distinction can perhaps be made between 
thieves and nonthieves, it is not clear to the reviewer that these two groups 
are likely to be effectively homogeneous internally, as required by the 
latent class model. It is very difficult to believe the statement by Lazarsfeld 
and Rossi (95: 133) that the latent dichotomy model is particularly ap- 
propriate wherever “administrative or managerial considerations require 
the notion of an underlying dichotomization of persons.” These authors 
propose use of the model in federal loyalty investigations in order to classify 
employees as “loyal” or “subversive”! 

If enough latent classes are permitted, the latent class model may approxi- 
mate the continuous latent structure model. These models are clearly of 
fundamental importance in the field of measurement. The reviewer pre- 
dicts that they will come into use regardless of any computational difficulties. 


Sealing Aptitude and Achievement Tests 


Models similar to the continuous latent structure model have been used 
in the area of cognitive tests for some years (90, 91). A very general model 
that does not specify the shape of the item trace lines was discussed non- 
mathematically by Lord (102), who listed 14 implications regarding the 
relations between obtained scores, true scores, and the latent trait measured. 
Lord (103) drew various implications from a model in which the item 
trace lines are normal ogives and reported an empirical study showing 
that the model gave a good fit when used to predict from the actual item 
difficulties and intercorrelations the empirically observed univariate and 
bivariate score distributions on mathematics tests. A good nonmathematical 
summary of this theory and the conclusions reached was given by Brogden 
(19). Cronbach and Warrington (34) and Lord (98, 103) independently 
applied this model to the problem of maximizing the discriminating power 
of the test. 
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Working with the normal-ogive model, Lord (98) derived the maximum 
likelihood equations for simultaneously estimating from the examinees’ 
responses to the items both the “ability score” of each examinee and the 
values of two parameters for each item. Tucker (155) obtained an approxi- 
mate least-squares solution for the same estimation problem and demon- 
strated its feasibility by an actual practical application. Tucker’s formulas, 
by a mathematical coincidence, also constitute a solution to the problem of 
scaling data obtained by the method of successive intervals. The method 
takes into account the large sampling errors of the scale values derived 
from extreme proportions. 

Carroll (25) proposed a valuable definition of test “homogeneity” (im- 
plicit in all the previous studies in this section): a test is homogeneous 
whenever the probability of a correct answer to item i by examinee j can 
be approximately expressed as a function of the difficulty of item i and the 
ability of éxaminee j. 

Tucker (156) discussed fundamental theoretical aspects of scaling 
by methods not based on normative data, with especial reference to apti- 
tude tests. Flanagan (48) and Lindquist (97), appearing on the same 
panel, discussed practical aspects of scaling, mainly with reference to 
achievement tests, for which they desired normative methods of scaling. 
Gardner (50, 51), appearing on the same panel, discussed his K scores 
which are used for the new Stanford Achievement Test (88). These scores 
were derived on the assumption that each of several groups tested will 
have Pearson Type-III score distributions when K-scores are used. Keats 
(87) reported that virtually all raw-score distributions are Pearson 
Type-I distributions. 

Burt (24) critically discussed Thurstone’s scaling methods as applied 
to the Binet, and described some of his own related work on the problem 
of assigning difficulty values to items and to tests. Fan (43) pointed out 
the theoretical fallacy in certain related applications of Thurstone’s scaling 
methods to the problem of equating different forms of the same test by 
means of item statistics; he also gave empirical examples. Swineford and 
Fan (145) proposed a new method for dealing with this problem and 
presented empirical results showing the superiority of their method. A 
maximum-likelihood method of equating, not involving item statistics, was 
derived by Lord (99). 

The recently published Technical Recommendations for Psychological 
Tests and Diagnostic Techniques (1) recommended that normalized stand- 
ard scores be used in preference to other derived scores. Gulliksen (70) 
and Flanagan (49) each published valuable chapters dealing with scaling, 
equating and norms. In his chapter Flanagan listed five main types of infor- 
mation needed about test scores: (a) task performed by the examinee, (b) 
relative standing in a specified group, (c) level of educational development 
at which this performance is typical, (d) growth during a fixed period, 
and (e) pattern of scores in various areas. The system of scaled scores that 
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he devised for the Cooperative Test Service was outlined there in some 
detail. 
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CHAPTER Ill 


Regression and Correlation 


CYRIL J. HOYT and MURRAY C. JOHNSON 


Since regression and correlation have not been treated as separate topics 
in earlier issues of the Review, it seemed desirable to cover more than 
the usual three-year period. Papers published in the last decade are in- 
cluded if they have not been reviewed under another topic yet appear to 
make significant contribution to current development in regression and 
correlation. Articles from journals in statistics, biometrics, and econo- 
metrics are cited whenever the authors consider the topics likely to be 
useful in educational research. 

Because regression theory permeates much of statistics, including cor- 
relation, analysis of variance, and experimental design, a substantial 
amount of theoretical work on regression has been done in recent years. 
Applied papers in education observed by the authors do not for the most 
part reflect these developments, some of which were originally intended 
for use in different fields. In other cases, an appropriate problem or a 
skilful translator may yet be needed. 


Regression Theory 


Papers on regression cover a wide range of topics reflecting the di- 
versification of problems in different areas of theoretical and applied 
statistics. A recent book by Anderson and Bancroft (3) gave the general 
regression model for r fixed variates and presented related computational 
procedures. Relationship of regression to experimental design and analysis 
of variance was treated in detail. Rao’s book (63) also included a useful 
discussion of the theory of linear estimation and tests of linear hypotheses. 

Durbin (22) considered best estimates of beta coefficients when there 
exists extraneous information about one of the coefficients. Daniel and 
Heerema (16) discussed problems of slope estimation and of linear ex- 
trapolation when the precision of y measurements changes with different 
values of x. 

The problem of estimating linear restrictions on regression coefficients 
for multivariate normal populations was treated by Anderson (4). The 
meaning and uniqueness properties of a method of representing a multi- 
variate distribution by a single straight line, a “line of organic correlation,” 
were discussed by Kruskal (48). 

A technic for estimating and comparing residual regression when there 
are two or more related sets of observations was developed by Carter (10). 
Tintner (70) developed a test for the existence of linear relationship be- 
tween weighted regression coefficients. Durand (21) proposed the use of 
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a joint confidence region as a measure of accuracy for several regression 
coefficients. 

Chand (11) considered the relative merits of different statistics avail. 
able for testing differences between two means or two regression coefficients 
in relation to one-sided (asymmetric) and two-sided (symmetric) alterna. 
tives for the case of unequal population variances. Test criteria for 
hypotheses of symmetry of a regression matrix were subsequently given 
by Chand (12). 

Advantages of computing orthogonal predictors were pointed out }, 
Kimball (46). Smith (68) showed that concentration of the observations 
near the extremes of the practicable range produced the greatest efficiency 
in determining regression by the use of the method of orthogonal poly- 
nomials. 

Auto-regressive models have been used extensively by economists. A 
summary of recent research dealing with the problem of testing for the 
existence of auto-correlated errors in regression models was given by An- 
derson (2). Orcutt and Cochrane (58) investigated the merits of trans. 
formation in regression analysis when error terms are auto-correlated and 
where more than one relationship between the variables exists. Tests were 
described by Watson and Durbin (76) for examining the presence of serial 
correlation in least squares regression. A general method of testing the 
significance of nonlinear regression was adapted to the case of exponential 


regression by Keeping (44). 


Other Regression Models 


A complication in regression analysis arises when it is unjustifiable 
to assume that the independent variables are free from errors of measure- 
ment. The determination (or identifiability) of parameter values in such a 
linear equation has been treated by a number of writers, particularly with 
reference to problems in economics. Such problems seem prevalent in 
educational and psychological research. 

One method of fitting a straight line when both variables are subject 
to error was pointed out by Wald (74). A modification of Wald’s method 
having the advantage of greater accuracy was described and illustrated 
by Bartlett (5). Berkson (6) employed classical regression theory to ex- 
press a functional relationship between two variables, both subject to 
error. In the case of the independent variable, he made a distinction be- 
tween “controlled” and “uncontrolled” measurements. Geary (30) dis- 
cussed Berkson’s results and gave a theory of estimation and of tests of 
significance for the case of controlled experiments where the relationship 
between two variables was nonlinear. 

The problem of identifiability was examined by Reiersol (64) in two 
different two-variable models. In one model the errors were assumed to be 
jointly normally distributed, and in the other to be stochastically inde- 
pendent. In both cases necessary and sufficient conditions for identi- 
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fiability were found. Tukey (71) also considered linear regression where 
both variates are subject to error. 

Kendall (45) discussed a number of obscurities which arise concerning 
the connection between regression, functional relationship, and structural 
equations in theoretical models containing a stochastic element. 


Correlation Theory 


Measures of association have long been of interest and value to educators 
since many research problems and problems of test analysis employ these 
tools. Primary emphasis in this review was not placed upon item selection 
technics since they have generally received close attention in other issues 
of the REVIEW. 

Gillman and Goode (32) described the estimation of the correlation 
coefficient of a bivariate normal distribution when X was truncated and Y 
was dichotomized. Maritz (51) showed that it was possible to estimate, by 
means of probit analysis, the r of a bivariate normal population when one 
of the variables was dichotomized. A formula for obtaining the correlation 
between a quantitative continuous variate and a discrete series was derived 
by Hsu (41). 

Brogden (8) related correlation coefficients to the selective efficiency 
of dichotomous predictors and continuous predictors at particular points 
of cut. Peters (61) proposed a new descriptive statistic related to the 
second order parabola in the same manner as the product-moment correla- 
tion is related to a straight line. du Mas (20) introduced a coefficient 
of profile similarity for quantifying the comparison of one profile with 
another. 

Cowden (15) discussed a partial correlation coefficient which was also 
a multiple correlation coefficient. Its relationship to other coefficients was 
explained and computational methods were suggested. Multiple correlation 
was estimated and interpreted by Naylor (56) thru use of a stereographic 
projection. 

Often the assumption of independence between sample values may be 
only approximately valid. Walsh (75) investigated the effect of this intra- 
class correlation on confidence coefficients of several well-known signifi- 
cance tests. As a means of reducing environmental disturbances in genet- 
ical investigations, Smith (67) proposed a weighted combination of 
observations which maximized an intra-class correlation. Fieller and 
Smith (24) discussed the relationship between intra-class correlation and 
the analysis of variance. 

Moran (55) considered some of the problems which arise in the study 
of partial and multiple rank correlation. Questions of distribution in the 
theory of rank correlation were considered by David, Kendall. and Stuart 
(17). Lyerly (49) derived a method of finding the average rank correla- 
tion without computing individual r’s. 

Tests of significance in canonical analysis were given by Marriott (52). 
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McHugh (50) assumed two variables from a normal multivariate population 
in describing a method of comparing two correlated sample variances. 


Product-Moment Correlation Coefficient 


A recent paper by Hotelling (40) presented the derivation of the dis. 
tribution of r in a new form with special consideration of certain series 
and integrals. Gayen (29) studied the theoretical distribution of the prod- 
uct-moment correlation coefficient in random samples of any size drawn 
from non-normal universes. In a special case of biological importance, 
Haldane (36) resented an expression for the effect of a non-normal 
distribution on the precision of the estimate of the correlation coefficient 
in the population. 

Pilliner (62) showed that when two variables had equal variances in a 
population, the estimate of the correlation obtained from the sample by the 
product-moment method was an overestimate. The best sample estimate of 
the correlation was that obtained from an analysis of variance. Wolbers 
(80) devised a technic based on Wald’s sequential analysis for showing 
that the correlational relationships between variables as a sample of cases 
is accumulated. 

Bittner and Wilder (7) explained how correlation coefficients could be 
cast as expectancy tables showing the percent of persons who make a 
specific score that would be expected to equal or exceed a given score on 
the associated variable. Tables for this purpose were prepared by Jackson 
and Phillips (42). Odell (57) gave empirical evidence that the contingency 
and product-moment coefficients could not be considered even approxi- 
mately equivalent. 


Serial Correlation 


Dudek (19) found that the higher the correlation, the less the dis- 
crepancy between the biserial and product-moment coefficients. Wherry 
and Taylor (79) showed point-multiserial r and multiserial eta to be 
identical when the number of categories is two, but different for three or 
more categories. Multiserial eta was also shown to be identical to the 
product-moment r when the categories were assigned certain scale values. 
Vaswani (73) gave an example bearing on the use of tetrachoric r as a 
“metrical” measure of association between qualitative characters. 

Michael, Hertzka, and Perry (53), Michael, Perry, and Guilford (54). 
and Perry and Michael (60) considered the problem of estimating other 
coefficients of correlation from the phi coefficient. Hertzka, Michael, and 
Perry (39) discussed systematic errors in estimates of a phi coefficient. 


Computational Methods 


While the theoretical aspects of least squares analysis have been inves- 
tigated since the beginnings of statistical science, certain technics have 
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emerged as particularly useful in the solution of these problems. Dwyer’s 
book (23), designed primarily for the individual worker using a desk 
calculator, covered basic linear computations. Topics included are deter- 
minants, matrix algebra, errors in linear computations, and applications 
of the methods to statistics. 

Burt (9) also gave numerical methods for the solution of linear equa- 
tions. Fox (27) described direct procedures for the solution of linear 
equations and the inversion of matrices. Van Boven (72) presented a 
modified pivotal condensation method for obtaining partial regression 
weights and multiple correlation coefficients. Some uses of pivotal conden- 
sation in statistical analysis were given by Collier (14). Goulden (34) 
included a discussion of appropriate computational methods in his recent 
book. 

A square root method of selecting a minimum set of variables in multiple 
regression was described and illustrated by Summerfield and Lubin (69). 
The computations were more compactly arranged than those of the Doo- 
little procedure and provided an F-ratio criterion for the selection of 
variables. Normal equations for obtaining regression weights in the pre- 
diction of a dichotomized criterion were presented by Wherry (78). 

Procedures for deleting, replacing, and adding independent variables 
subsequent to the solution of the normal system were described by West 
(77). Kossack (47) presented a method for computing the zero-order 
correlation coefficients for a correlation matrix by using two matrices, 
a summation matrix, and a computational matrix. 

Methods of solving simultaneous linear equations with Hollerith machines 
were described by Fox, Huskey, and Wilkinson (28), Hartley (37), and 
Healy and Dyke (38). Rummel (65) presented simplified procedures for 
computing zero-order correlations among several variables. Perloff (59) 
using zero-order coefficients alone worked out a tabular method for obtain- 
ing partial regression coefficients for four and fewer variables. 

Chown and Moran (13) suggested the use of a coefficient of the ranking 
type as an efficient estimator of the product-moment r. Short methods for 
calculating correlation coefficients examined by Flanagan (25) were found 
to furnish estimates of population values as accurate as those furnished 
by conventional computation. 

Gengerelli (31) proposed a simplified method for approximating partial 
regression coefficients. The method, one of exhaustion, dispensed with the 
solution of simultaneous equations. A procedure for rapid calculation 
of the slope of a regression line was suggested by Aldridge, Berry, and 
Davies (1). A graphic solution of multiple and partial correlation coef- 
ficients for three-variable problems was proposed by Davidoff (18). A 
short-cut method for computing multiple R was given by Jenkins (43) in 
which a table was used to obtain close approximations of beta coefficients. 
Foote (26) detailed the mathematical basis for a method of graphic multiple 
correlation. 


A graphical method for computation of biserial and point biserial corre- 
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lation was developed by Goheen and Davidoff (33). Siegel and Cureton 
(66) described a method for computing biserial r in terms of an analysis 
using several criteria. Suggestions were given by Guilford (35) for esti- 
mating point-biserial r between item and total score, and estimating a 
corrected point-biserial r directly from item data. 
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CHAPTER IV 


Discriminant Analysis 


MAURICE M. TATSUOKA and DAVID V. TIEDEMAN 


Previous triennial reviews of statistical methods in educational research 
published in this journal have considered discriminant functions within 
the broader context of multivariate analysis (14, 52, 54). During the past 
few years, however, the potential value of the discriminant function in 
psychological and educational research has received considerable emphasis 
in the literature. Consequently, this issue isolates discriminant analysis 
as a separate topic and provides a summary and integration of the literature 
on this method. 

In educational and psychological research investigators frequently record 
a series of p observations for each of a number of individuals who may 
be classified into one of two or more mutually exclusive groups. For in- 
stance, three measures of cooking competence might be obtained for each 
pupil taught by either of two different cooking methods; or the eight 
Differential Aptitude Test scores of each member of a ninth-grade class 
might be distributed into one of four curriculum groups according to the 
curriculum which each student presented for high-school graduation. 
With data such as these, we first ascertain whether or not there is a stable 
difference in the p observations among the k populations of which the k 
sets of observations are samples. For those data in which a stable differ- 
ence exists, interest then turns to: (a) the distances separating pairs of 
the k groups, (b) the directions in which the k groups differ, and (c) the 
assignment to one of the k groups of an unclassified individual known to 
belong to one of the & groups. Significance, distance, direction, and as- 
signment are the issues of discriminant analysis. The linear discriminant 
function is one of the tools available for study of these issues. As will be 
shown, a number of technics of discriminant analysis can be integrated 
thru the theory of the linear discriminant function. 


The Two-Group Discriminant Function 


In 1936, Fisher (34) considered the problem of getting a linear combina- 
tion of p variables which would, better than any other linear combination, 
discriminate between two chosen groups. By “better discrimination” he 
meant, specifically, that the ratio of the among-groups sum-of-squares 
of this linear function to its within-groups sum-of-squares (hereafter called 
the discriminant criterion) would have a larger value than that for any 
other linear function of the same variables. He found that the combining 
weights yielding this optimum linear combination, which he termed the 
discriminant function, are the elements of the column vector v satisfying 
this matrix equation: Wv = dk. Here W is a square matrix whose ele- 
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ments are the sums-of-squares and the sums-of-cross-products, within the 
two groups, of the p original variables; d is the column vector of the dif- 
ferences between the group-means on the p variables; and k is an arbitrary 
constant, since the discriminant criterion (being a ratio of two sums-of- 
squares) requires the weights to be determined only up to proportionality. 

Fisher proposed the discriminant function as a solution to the problem 
of using information from a number of correlated variables to classify an 
unclassified object into one of two groups to which it must belong. He 
early pointed out the similarity of the discriminant function to other 
existing multivariate methods for treating two-group classification prob- 
lems and other related problems. In this paper (33) he demonstrated that 
the discriminant function is proportional to the point biserial multiple 
regression function predicting a “dummy” criterion variable indicating 
group-membership. He also showed that the difference between the group- 
means on the discriminant function is proportional to Hotelling’s T ? (44) 
which is a generalization of Student’s t-statistic to multivariate cases, and 
is also proportional to Mahalanobis’ D* (62) which is a measure of the 
“distance” between two groups. Because of the relationship of the dis- 
criminant function to Hotelling’s T *, Fisher was able to give a firm theo- 
retical basis to his 1936 z-test (or the now more common F-test) for the 
null hypothesis that the multivariate centroids in the two populations coin- 
cide. Further, its relationship to the generalized distance D? enables one 
to use the discriminant function in studies of distances between pairs 
of group-centroids. Finally, the proportionality to the point biserial multiple 
regression function implies a mathematical relation between the value of 
the discriminant criterion (for the discriminant function) and that of the 
multiple correlation coefficient. 

A year later, Welch (110) published a short note treating the classi- 
fication problem from the standpoint of maximizing the probability of 
correct classification, and showed that Fisher’s discriminant function was 
the optimum one (in the above probability-sense) when the population 
distributions of the p variables were multivariate normal with equal vari- 
ances and covariances (hereafter, briefly, dispersions). Thus, Fisher’s 
apparently intuitive use of the discriminant criterion in connection with 
classificatory problems received justification in terms of what is the most 
natural criterion in such problems, high probability of correct classifica- 
tion, at least under those conditions usually satisfied by the sorts of data 
with which he was primarily concerned. 

Altho Fisher’s 1938 paper integrated the discriminant function with 
other multivariate methods for two-group problems and Welch’s 1939 paper 
established its validity for a special case of the classification problem, 
Fisher carefully noted important differences among the intents of these 
methods. Central to all further issues in the study of multivariate differ- 
ences between groups is the establishment of difference between the multi- 
variate populations of which the two sets of observations are samples. If 
the two populations are multivariate normal with equal dispersions, the 
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only issue of difference is whether or not the centroids of the two popu- 
lations are situated at the same point in the multivariate space. Hotelling’s 
T?-test provides an exact method for answering this question. If the an. 
swer is in the negative, three further questions become pertinent: (a) What 
is the distance separating the two groups? (This is the question with which 
Mahalanobis and others of the Indian school were primarily concerned. | 
(b) In what direction do the two groups differ? (This is the question 
that concerned Fisher in 1938; it is appropriately answered by getting 
the discriminant function.) (c) How may an unclassified individual known 
to belong to one of the two groups best be assigned to a group? (This is 
a question which originally concerned Fisher, but which is best pursued 
in terms of the Neyman-Pearson likelihood-ratio principle, as will be noted 
later.) 

In the special case of two multivariate normal populations with equal 
dispersions, it will be noted that the linear discriminant function spans 
the field, so to speak. While, essentially, it estimates only the direction 
of the difference between populations, it can be used in studying the other 
questions as well, by virtue of its relationship with T *, D?, and the likeli- 
hood-ratio solution to the classification problem. This versatility of the 
discriminant function quickly appealed to psychologists, and by 1938, 
Wallace and Travers (109) in the study of specialty salesmen made its 
first application to psychometric data. Travers continued his interest in 
the linear discriminant function, and introduced it to the American litera- 
ture in 1939 (103). 

The reception accorded this new technic by American psychologists 
was ambivalent. To be sure, the discriminant function was described and 
discussed in several early papers such as those by Garrett (35) and by 
Lorge (60). The main emphasis in these papers, however, seemed to be to 
point out that the discriminant function was proportional to the point 
biserial multiple regression function, and that hence (by implication at 
least) the new technic was nothing more than a special case of the already 
familiar regression analysis. This tendency was accentuated by Wherry’s 
1947 paper (112) which showed that the same proportionality of com- 
bining weights holds when biserial instead of point biserial correlations 
are used in the normal equations for getting the partial regression weights. 
What Wherry failed to recognize was the distinction among questions of 
relationship, distance, and direction, which Fisher had early noted. Wherry 
consequently recommended that the use of discriminant functions be aban- 
doned “in favor of the easier and better understood correlation approach” 


(112: 190). 


Nevertheless, the discriminant function was used by some psychologists 
and educators in the study of their problems. Selover (89) used dis- 
criminant analysis to study differences between sophomore test scores 
of graduate concentrators in various fields. Kuder (58) developed the M-F 
Scale and the Accountant-Auditor Scale for his Preference Record by com- 
puting discriminant functions for the respective pairs of groups contrasted. 
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Baten and Hatcher (12) differentiated between groups of students taught 
cooking by two different methods, using a discriminant function based 
on three measures of cooking ability. Harper (37) used the technic for 
classifying individuals into normal and schizophrenic groups on the basis 
of Wechsler-Bellevue subtest scores. 

Interest in the discriminant function is further indicated by the number 
of studies presenting approximate solutions. The exact solution becomes 
laborious when the number of variables is extremely large, tho it is by no 
means impracticable today when fairly rapid procedures for matrix inver- 
sion are available. Beall (13) and Jackson (50) proposed several approxi- 
mate methods of solution which depend on an assumption that the correla- 
tions between pairs of variates are all equal. Penrose (72) showed that 
under this assumption (plus the customary one that the dispersions in the 
two populations are equal) two particular linear combinations, which he 
called the “size” and “shape” factors, could be constructed so that the 
discriminant function based on these instead of on the original p variables 
is exactly the same as that which would be obtained with the latter. Horst 
and Smith (43) illustrated a rapid iterative method for computing two- 
group discriminant weights via the multiple regression approach. 

Standard errors of the discriminant function coefficients were obtained 
by Bartlett (7). Fisher (31) solved the more general problem of testing 
whether an obtained discriminant function is significantly different from 
an hypothesized one advanced either on theoretical grounds or from con- 
siderations of simplicity. 

An interesting modification of the discriminant function was proposed 
by Cochran and Bliss (24). They pointed out that in certain cases variables 
which in themselves do not significantly differentiate between two groups 
may nevertheless enhance the discriminatory power of other variables, 
when the latter are replaced by deviations from their respective within- 
groups regressions on the former (which are called covariance variables) . 

In view of the proportionality between the discriminant function and the 
point biserial or biserial multiple regression function, the distinction be- 
tween them might seem to be merely a matter of nomenclature and slightly 
different computational routines, in which case the explicit or implicit rec- 
ommendations made by Wherry and others to abandon the term “dis- 
criminant function” and the computational procedure associated with it, 
would be justifiable on the grounds of parsimony. But proportionality 
obtains only if one is interested in comparing just two groups at a time. 
As soon as one is concerned with studying three or more groups simul- 
taneously, the proportionality in question no longer holds for the simple 
reason that it then becomes impossible to consider a linear multiple regres- 
sion function save in the highly exceptional case when the centroids of the 
several populations are all collinear. Bryan (19), Rulon (86, 87), and 
Tiedeman (95) made this point. 

Any statistical technic restricted to the comparison of two groups at a 
time is severely limited in its usefulness. Consequently, as soon as a method 
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becomes available for the study of groups in pairs, there are immediate 
attempts to develop its analog without restriction as to the number of 
groups. Thus, multivariate tests of significance, the generalized distance 
measure, the linear discriminant function, and the likelihood-ratio solution 
to classification problems have all been appropriately generalized to deal 
with multigroup problems. 

Despite relationships later noted among these generalizations, each de- 
velopment proceeded largely without reference to any other. Consequently. 
it is convenient first to record each generalization separately, and only 
then to note their relationships. In reviewing these developments we will 
necessarily confine ourselves to those aspects bearing most directly upon 
the topic of discriminant analysis. 


Multivariate Tests of Significance 





Given observations on p variables for two samples, of size N, and N, re- 
spectively, drawn from two populations in which the joint distributions 
of the p variables are multivariate normal with equal dispersions, we want 
to know whether or not it is reasonable to infer that the two population 
means on each of the variables are equal, or, in geometric language, that 
the two centroids coincide in the p-dimensional variate-space. This was 
the type of problem that led Hotelling (44) in 1931 to devise his T°- 
statistic and study its sampling distribution. He noted that it is not appro- 
priate to perform p separate t-tests; for we may get conflicting results— 
some indicating significant differences, others not—and we would not be 
able to combine the different probabilities into a single over-all probability 
since the t-values are not experimentally independent. Guided by a general 
principle of considering invariants, or quantities which remain unchanged 
under rotation of the axis-system used in representing the observations 
geometrically, Hotelling derived the appropriate statistic for this problem, 
and also gave its sampling distribution explicitly. 

One year later, Wilks (113) solved the multi-group extension of this 
problem by following the likelihood-ratio principle of Neyman and Pearson 
(68). He thus provided a test for the over-all significance of the dispari- 
ties among the group-centroids when & (the number of groups) is more 
than 2. Wilks’s statistic, which reduces to a simple function of Hotelling’s 


2 in the special case of two groups, is defined as the following ratio 
of two determinants: 












\ |W 
"| W+Al, 
where W is the k-group extension of the W previously defined, and A is the 
among-groups sums-of-squares and sums-of-cross-products matrix. The exact 
distribution of A is known only for certain special values of p and k. How- 
ever, Bartlett (8) provided an approximate large-sample test of significance. 
Both Hotelling’s T?-test and the Wilks-Bartlett A test require that the 
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dispersions in the populations tested be equal. Consequently, a test for the 
homogeneity of the several dispersions should be made before either of 
these tests is applied in practice. The appropriate criterion for this pur- 
pose was given by Wilks (113) in the same paper in which he derived the 
A criterion for the disparity of centroids; but its sampling distribution 
has not been reduced to a usable form. A statistic with a somewhat less 
complicated distribution and applicable to the case of two samples only, 
was derived by Roy (85); but even this is of such complexity as to pre- 
clude its practical application unless the number of variates is quite small. 
Recently, Pillai (73) announced his derivation of a more usable expression 
for the distribution studied by Roy, with exact cumulative distribution 
given for any number of variables up thru eight. Since it is difficult to 
test the homogeneity of dispersions, it has been customary to assume 
equality of the population dispersions when the main interest lies in the 
disparity among centroids. 


The Generalized Distance 


Two of the earliest works which treated taxonomic problems by statistical 
methods were those by Morant (66) and Tildesley (99), both of whom 
used an index, “the coefficient of racial likeness” proposed by Karl Pearson, 
for classifying skeletal remains into racial groups on the basis of a number 
of anthropometric measurements. The coefficient of racial likeness takes 
no account of the intercorrelations among the several variables and is 
considerably affected by the size of the samples available for study. Ma- 
halanobis’ generalized distance eliminated these deficiencies (62). The 
theory and application of this statistic have been treated extensively in the 
Indian literature. 

While a comparison shows that D*® and T? are proportional to each 
other, the distinctive feature of D*, as Mahalanobis (62) pointed out, is 
that it is a measure of distance rather than a criterion for testing the 
hypothesis of zero-distance. Consequently, the primary interest here is to 
establish confidence limits for the parameter (that is, the distance between 
the population centroids, or “true” distance) being estimated by the D?- 
statistic. Bose and Roy (15) and Roy (84) studied the noncentral D?- 
distribution, which is important for this purpose. Here the usual central 
D*-distribution is not appropriate as it would be based on the assumption 
that D? is zero, an assumption unjustified in cost cases of real interest. 

Another relevant question in the study of group-distance is that of 
whether the distance is significantly increased by the addition of, say q 
further variables beyond the original p. Rao (76) gave an exact test for 
answering this question, using a statistic defined as a function of two 
D*s, one based on all p+q variates and the other on the first p variates 
alone. For large samples, an approximate test, also given by Rao (75), is 
available which involves simply the difference, D*,.,.—D*,, of the two 
distance statistics. 
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A particularly fruitful application of the D* approach to the study of 
groups was made in the extensive anthropological survey conducted by 
Mahalanobis, Majumdar, and Rao (63) in the United Provinces in India. 
Briefly, they computed D?, based on nine anthropometric measures, be- 
tween all possible pairs among 12 different castes and tribes, and then 
determined five clusters of such groups, on the criterion of having large 
values of D? between groups belonging to different clusters and a small 
value for any pair of groups within each cluster. 

Not only was this anthropological survey of practical significance; it 
also stimulated a number of theoretical developments, prominent among 
which was a method for representing the group-centroids in a space of fewer 
dimensions than the original p. The procedure, developed by Rao (82), 
consists in seeking the particular set of t (<p) linear combinations of the 
original variables which yields the maximum average-D* (over all pairs 
of groups) possible with that number of linear combinations. Rao found 
that the desired sets of combining weights are given by the several vector 
solutions, v; (i=1,2,...,¢), of the matrix equation (A—A W)v—O, 
where A and W are as defined before, and A is a Lagrangian multiplier. 
By this method, a transition is effected from the essentially two-group con- 
cept of distance to a multi-group concept of configuration. Distance, being 
a scalar quantity, does not exhaust the information in a comparison among 
three or more groups unless the latter happen to be collinear; configura- 
tion includes the idea of direction, about which more will be said later. 

This technic of reduction of dimensionality, sometimes known as canon- 
ical reduction, can be regarded, in a broad sense, as a generalization of the 
D?-approach. A direct generalization of the D?-statistic itself was made 
by Rao (80). He also showed that the sampling distribution of his statistic 
for large samples is approximated by the chi-square distribution with 
p(k-1) degrees of freedom. These results, in turn, play an important role 
in the canonical reduction technic; namely, they offer a basis for finding 
the number, ¢, of linear combinations needed to exhaust the statistically 
significant information concerning group-configuration inherent in the 
original data, when these linear combinations are obtained by canonical 
reduction. This dimensionality-test, as well as another, slightly more ac- 
curate one (which is based on the Wilks-Bartlett A-test noted earlier), 
was described by Rao and Slater (83) in a paper which illustrated the 
canonical reduction technic thru an example involving psychological data 


for five neurotic groups and a normal (control) group of British Army 
officers. 


Discriminant Functions for More Than Two Groups 


Problems involving multi-group comparisons were treated quite early by 
means of discriminant functions. Barnard (3) studied the changes with time 
in four series of Egyptian skulls by means of that linear function of four 
anthropometric indexes which maximized the discriminant criterion. Day 
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and Sandomire (28) computed discriminant functions based on several 
selected combinations of a number of biometric measurements for differen- 
tiating among four age-groups of deer. In both these papers, the fact that 
the groups studied had a natural ordering in time was utilized to modify 
the discriminant criterion; the among-groups regression of each variate 
on time was used in place of the unrestricted sum-of-squares. For this 
procedure to be valid, not only must there be a variable which orders 
the groups, but also the regression of each of the p variables on the ordering 
variable must be linear, which is the same thing as requiring the several 
group-centroids to be collinear in the p-dimensional space. 

Fisher (32), Johnson (53), and Mather (64) each described a method 
for computing for more than two groups a discriminant function that does 
not require an a priori ordering variable. In this case, the condition that 
the discriminant criterion be maximized yields precisely the same equa- 
tion that Rao arrived at in his canonical reduction procedure. Now, how- 
ever, A stands for the value of the discriminant criterion for the linear 
combination specified by the corresponding vector solution v. The vector 
solution v, corresponding to the largest root 1, of the determinantal equa- 
tion |A—A W|\=0 gives the required set of discriminant function weights. 
This vector is also known as the latent vector of the matrix WA cor- 
responding to the latter’s largest latent root, as i, is called. 

While this method does not depend on having a natural ordering variable 
to start out with, it is clear that considering only one linear combination 
as the discriminant function in effect makes a linear ordering of the 
groups. Consequently, the discriminant function thus defined does not 
exhaust all the information in the data relevant to group-separation, except 
in the rare case when the population centroids are in fact collinear. Recog- 
nizing this, Fisher (33, 34) devised a test of collinearity. A related but 
more general test-criterion and its distribution were subsequently studied 
by Tintner (100). These tests tell when to use the single discriminant 
function defined above without appreciable loss of information; but they 
do not indicate what is to be done when that function is inappropriate 
because the centroids cannot be assumed to be collinear. 

It is difficult to say when and by whom the appropriate extension of 
the discriminant function idea to situations involving three or more non- 
collinear groups was first accomplished. Fisher from the outset seemed to 
have such an extension in mind when he spoke of “the second canonical 
variate” (33). Guttman (36) and Tukey (104) referred to the successive 
vector solutions of a matrix equation of this form. However, no computed 
example of multi-group discriminant functions qua discriminant func- 
tions appeared in the literature until Bryan (19, 20) gave a reasonably 
workable computational routine for getting the successive latent roots and 
vectors of the matrix in question, and also showed that the linear combina- 
tions defined by these successive vectors have this property: the first 
linear combination (corresponding to the largest latent root) maximizes 
the discriminant criterion in Fisher’s original sense; the second linear 
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combination (whose associated latent root is the second largest) maximizes 
the ratio of the residual among-groups sum-of-squares to the residual 
within-group sum-of-squares after the effect of the first linear combination 
has been removed; the third linear combination maximizes the ratio of the 
corresponding sums-of-squares after the effects of the first two have been 
removed; and so forth. Bryan called these successive linear combinations, 
multiple discriminant functions. For k groups there are k-1 of these, except 
in the unusual case when the number of variables, p, is smaller than k-], 
in which case there are p discriminants. 

While multiple discriminant functions were originally intended primarily 
to serve as tools in multi-group classification problems, they also enabled 
the study of directions of group-differences in precisely those cases of real 
interest (noncollinearity) which for the single discriminant function con- 
stituted “nuisance cases,” so to speak. In such studies the general purpose 
bears a certain resemblance to that of factor analysis (5) in the sense 
that a parsimonious description—and perhaps incidentally a “meaningful” 
one in terms of the particular field of research—is sought which satisfac- 
torily accounts for the inter-group variations by means of a smaller number 
of variables than the original p. This was illustrated by Tiedeman, Bryan, 
and Rulon (97), who obtained multiple discriminant functions, based on 
the 17 tests of the USAF Airman Classification Battery, for differentiating 
among eight Air Force specialty groups and interpreted the first two func- 
tions as “factors” representing certain psychological traits. 


Classification Problems 


Welch’s solution (110) to the classification problem, noted earlier, was 
essentially an application of the Neyman-Pearson likelihood ratio prin- 
ciple (68) for tests of statistical hypotheses. The solution takes the form 
of a partitioning of the p-dimensional sample space into as many regions 
as there are groups, with the rule to classify an individual into the ith 
group if his predictor score combination, treated as coordinates in the 
space, determines a point (the “observation point”) within the ith region. 
For the two-group problem, Welch’s solution is perfectly general with 
respect to the type of distribution of the predictor variables, provided only 
that the “true” distributions are completely known. He concluded that “the 
best function to discriminate between two completely specified popula- 
tions . . . is simply the ratio of the two probability laws, p,/p.” (110: 218). 
In other words, an individual is to be assigned to the first group if his 
observation point falls in the region in which p,/p- is greater than some 
suitably chosen constant, and to the second group otherwise. This decision 
rule has the property of minimizing the probability of misclassification 
subject to the condition that the probability of misclassification is the 
same for both groups. A similar, but somewhat more general, problem was 
treated by von Mises (106): more general in number of groups, and also 
in that the condition of equal probabilities of misclassification was not 
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arbitrarily imposed but followed as part of the solution. Smith (90) defined 
the discriminant function as the logarithm of the ratio of the probability 
densities in two groups and, like Welch, showed that this logarithm reduces 
to the linear discriminant function in the special case of two multivariate 
normal distributions with equal dispersions. When the dispersions cannot 
be assumed equal, the discriminant function thus defined is a quadratic 
function of the predictor variates. 

In a series of articles, Rao (77, 78, 81, 82) obtained much the same 
results as the foregoing authors, tho his treatment was more systematic 
and mathematically more rigorous. One new feature in Rao’s work was the 
introduction of a “doubtful region,” corresponding to suspending the deci- 
sion of classification. 

A classificatory criterion known as the centour score corresponding to 
an observation point was proposed by Tiedeman, Bryan, and Rulon (97). 
When the p variables are distributed in a multivariate normal manner in 
each of the k populations, the centour score expresses the percent of mem- 
bers of a given population outside the iso-frequency contour on which 
the observation point falls. Hence a natural classificatory procedure is to 
compute an individual’s centour score with respect to each of the k groups 
under consideration, and to assign him to that group for which his centour 
score is the highest. The main purpose is to permit the construction of a 
table which can be used by a practitioner not highly trained in mathe- 
matical statistics. Here multiple discriminant functions play an important 
role for two reasons. First, reducing from p to at most k-1 (when k-1<p) 
the number of variables on which the centour scores are based reduces the 
size of an otherwise unwieldy table to practicable dimensions. Second, 
being linear combinations of a number of variables, the discriminant func- 
tions will, in general, have joint distributions that are more nearly multi- 
variate normal than those of the original variables; this means that the 
centour scores based on the former will come closer to having the in- 
tended meaning noted above. 

Strictly speaking, all the foregoing solutions require knowledge of the 
several population distribution functions. While it is intuitively plausible 
that results which hold exactly when the population distributions are 
known should hold approximately when the parameters of the parent dis- 
tributions are estimated from samples, it was Hoel and Peterson (42) 
who first gave a mathematical proof that such indeed is the case, asymp- 
totically, for classification problems when maximum likelihood estimates 
of various parameters are used. This result gave firm theoretical footing 
to the foregoing work. 

There is an extensive literature on yet another approach to the classi- 
fication problem, developed primarily in clinical psychology, and known 
as methods of assessing profile-similarity. Cronbach and Gleser (27) made 
a comparative study of these various methods, and showed that the dif- 
ferent measures proposed by different authors for describing the degree 
of similarity between two persons (or the same person at two different 
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times), between a person and a group, or between two groups, are all 
ultimately related to a distance-measure, D, proposed by them as well 
as by Osgood and Suci (69). In particular, Cattell’s (21) index, r,, is a 
function of D which effects an order-reversing mapping of the values 
of D into the interval from —1 to 1, in order to transform D to values 
comparable to those of a correlation coefficient. Several other commonly 
used indexes, such as du Mas’s ry, (29), and Stephenson’s Q correlation 
(91), were found by Cronbach and Gleser to have the limitation of re- 
moving the effects of the intra-individual mean and variance of the several 
profile points from the information conveyed by their D. In the opinion 
of the present reviewers, the measure D itself, which is essentially the same 
thing as Pearson’s coefficient of racial likeness (except that raw-score 
differences, rather than standardized measures, are used), suffers from 
much the same limitations as does that coefficient. In particular, ignoring 
possible correlation among the variables would seem to distort the distance- 
measure by treating partly overlapping pieces of information on an equal 
footing with independent ones, tho Cronbach and Gleser advance some 
arguments to justify this treatment. Other difficulties of the profile-similarity 
approach have been discussed by Rulon and others (88) and by Tiedeman 
(94). 

Regardless of what approach one takes in solving a classificatory prob- 
lem, it is often desirable to test whether the classification thereby attained 
is significantly better than chance. A nonparametric test for this purpose 
was proposed by Lubin (61), who also suggested a statistic for testing 
whether one classificatory procedure is significantly better than another. 

Wald (107) introduced a new consideration into the classification prob- 
lem by taking into account the possibly different costs of different types 
of misclassifications. He defined the risk of an error as the cost multiplied 
by the probability of committing that error; and the partitioning of the 
sample space into classificatory regions was based on the principle of 
equalizing the risks associated with all groups. Anderson (1), Brown (17). 
and Rao (82) each incorporated the concept of risk in their general treat- 
ments of multi-group classification problems. Anderson pointed out that 
the optimum decision rule depends on whether or not one knows a priori 
the proportion of each population in the mixed population at large. The 
respective solutions are the so-called Bayes’ solution, and the minimax 
solution as discussed extensively in Wald’s theory (108) of decision func- 
tions (see Chapter VII of this issue). Important as this new approach 
is, it seems as yet to remain in the realm of theory, owing to the practical 
difficulty of determining the differential costs of misclassifications. Kos- 
sack (57) gave an illustrative example of Wald’s procedure, with full com- 
putational detail. However, the costs were all taken as equal, so that the 
solution was essentially that of minimum probability of misclassification. 

Two variants of the classificatory problem arise when: (a) the groups 
are not defined a priori but only after observations are made on a sample 
of undifferentiated individuals and the data warrant a separation into 
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groups; and (b) there are predetermined quotas to be filled for each of 
several groups. A classical example of the first type of problem is Pear- 
son’s resolution of a mixed series into two normal components in the 
univariate case (71). More recently, and from quite a different standpoint, 
Paulson (70) treated the problem of separating individuals into a “su- 
perior” and an “inferior” group, if the data so warrant, and concluded 
that all individuals are “neutral” otherwise. The second variant, which 
may be termed problems of allocation, was exemplified by Rao (79). Its 
solution, at least for the two-group case, is independent of a priori prob- 
abilities and of differential costs, even tho these enter into the formula- 
tion of the problem. The decision rule is simply to fill the quota, say \,, 
for the first group by allocating to it all the individuals for whom the like- 
lihood-ratio (group 1 to group 2) has the N, largest values, and assigning 
the rest to the second group. 


Discriminant Analysis: An Integration 


So far as two-group cases are concerned, the papers by Fisher (33), 
Welch (110), and others noted in the first section gave fairly exhaustive 
discussions of the relationships holding among the various multivariate 
statistical methods surveyed. Similar discussions, from somewhat different 
standpoints, were given by Hotelling (46), who used the concept of in- 
variants to unify the various statistics, and by Kullback (59), for whom a 
measure of “iniormational divergence” was the central concept. Tintner 
(102) also discussed these relationships from a formal mathematical stand- 
point, and Tyler (105) published an expository article with a computed 
example involving psychometric data, which illustrated the T?-test and 
a classificatory procedure based on Rao’s work. 

Bartlett (4, 6, 8) was one of the first authors to introduce vector and 
matrix ideas into the statistical literature; he has long been concerned with 
relationships among various multivariate technics without restriction as to 
number of groups. He placed the technics within the framework of a 
generalized regression theory, which had its origin in two papers by 
Hotelling (45, 47). The “criterion variable” in generalized regression 
theory, or canonical correlation theory as it is usually called, is not a 
single variable as in multiple regression, but a set of*variables from which 
a particular linear combination is formed so that its correlation with an 
optimum linear combination of the predictor variables is a maximum. 
There are thus two sets of combining weights—one for the predictor 
variables and another for the criterion variables—to be obtained in a 
canonical correlation problem. Hotelling showed that these sets of weights 
are given by vector solutions of certain equations. Each value which satis- 
fies his equations gives the square of the correlation coefficient between 
the combination of predictor variables and that of criterion variables 
specified by the corresponding vector solutions. 

When a set of dummy variables numbering one less than the number 
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of groups is taken as the criterion set representing group-membership, 
the vector solutions of the first of Hotelling’s equations are proportional 
to those of the equation fundamental both to Rao’s canonical reduction 
procedure and to Bryan’s multiple discriminant functions. This fact js 
implied in Bartlett’s interpretation of discriminant analysis (4, 6). as 
well as Brown’s formulation of the discriminant problem (18), and has 
been given an explicit algebraic demonstration by Tatsuoka (92). More 
importantly, it was Bartlett’s integration that permitted the use of Wilks’s 
A criterion for disparity of centroids in the significance test in Rao’s 
canonical reduction of the D?-approach, noted earlier. The identity of 
Rao’s and Bryan’s equations means that the same test may be used for 
determination of the number of multiple discriminant functions to be 
retained as representing dimensions of significant intergroup variations, 
as was illustrated by Tiedeman and Bryan (96). In case the population 
centroids are all identical, the number thus determined will of course be 
zero. 

The formal equivalence of Rao’s and Bryan’s equations has the further 
consequence of relating Rao’s generalization of the D*-statistic with the 
discriminant criterion. But, whereas in Rao’s framework ) was introduced 
merely as a Lagrangian multiplier, in Bryan’s formulation it expresses 
the value of the discriminant criterion for the linear combination with 
weights given by the corresponding vector solution. Thus, the sum of the 
values of the discriminant criterion accruing to the successive discriminant 
functions is proportional to Rao’s measure of the over-all distance among 
k groups. 

Turning now to multi-group classification problems, Tatsuoka (93) has 
verified that the minimax solution based on multiple discriminant functions 
is equivalent to that based on the original variables, in the special case 
when all the population distributions are multivariate normal with equal 
dispersions. 

Thus, we see that discriminant analysis can be used as a unified approach 
in solving a research problem involving multivariate comparison of several 
groups, which is likely to have as its three phases, (a) the establishment 
of significant group-differences, (b) the study and “explanation” of these 
differences, and finally (c) the utilization of multivariate information 
from the samples studied in classifying a future individual known to belong 
to one of the groups represented. Of course, if the objective is either 
(a) or (c) alone, direct methods are available which do not require the 
computation of multiple discriminant functions. For objective (a), 
Hotelling’s T?-test or the Wilks-Bartlett A-test is appropriate; for objective 
(c), the procedure based on the Neyman-Pearson likelihood-ratio principle, 
as developed by a number of authors and extended and systematized by 
Rao, is used. It is when the basic-research purpose (b) is included among 
the researcher’s objectives that discriminant analysis, or, alternatively, 
Rao’s canonical reduction of the D?-approach, offers a distinctively new 
contribution and at the same time provides statistics for studying objectives 
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(a) and (b) without independent attack. The two methods are differenti- 
ated only by the initial orientation and the computational procedure, de- 
vised respectively by Rao and by Bryan. These procedures for solving the 
same basic equation yield respectively the canonical variates or the multiple 
discriminant functions. In Rao’s procedure the inverting of the matrix W is 
avoided by transforming the original set of p variables into a new set of p 
uncorrelated variables, which have an easily inverted diagonal matrix. 
The latent roots and vectors are then obtained by an iterative method 
developed by Hotelling (48). Bryan’s procedure does not dispense with 
the inverting of W, but gets, by a series of transformations, a (k-1, k-1) 
matrix whose latent roots are identical with those of W-'A, and from whose 
latent vectors those of W~'A can readily be derived. The method for getting 
the latent roots and vectors of the reduced matrix is exact. Bryan’s pro- 
cedure recommends itself especially in cases where the number of groups is 
much smaller than the number of variables. 

Harris (39) recently reviewed Rao’s method of obtaining a set of 
uncorrelated variables and demonstrated the interesting fact that this set 
of variables is related to the ultimately desired canonical variates by an 
orthogonal transformation. He suggested that, for the purpose of describing 
the k groups in the p-space, Rao’s uncorrelated variables are as satisfactory 
as the canonical variates. This is true if “description” means a report of the 
distances separating pairs of groups; it is not true if “description” also 
includes indicating the smallest subspace occupied by the group-centroids. 


Applications of Discriminant Analysis 


Despite the extensive theoretical developments noted here, discriminant 
analysis is virtually unused in education and psychology. About as many 
applications of discriminant analysis have been made for the purpose of 
illustrating the method (20, 49, 65, 96, 97, 98, 103) as for using the method 
as a research tool with little explanation of it (2, 12, 22, 30, 51, 58, 89, 109). 
For some psychologists and educators the discriminant function is still little 
more than a superfluous way to compute a multiple regression function 
for a dichotomous criterion (35, 38, 67, 112). The discriminant function 
seems to be accepted in its own right in the field of biometrics (9, 10, 11, 
16), but to be still in the expository phase in the field of econometrics (101). 

Despite these facts, the discriminant function is presently discussed 
in three texts on mathematical statistics (41, 55, 56), three texts on 
biometrics (32, 64, 74), and two texts on psychometrics (53, 111). Rao’s 
book (74) merits intensive study by anyone interested in discriminant 
analysis. Rulon and others (88) recently completed a detailed monograph 
on the use of multivariate methods in profile analysis in which the multiple 
discriminant function is also considered extensively. 

Several reviews of discriminant analysis have appeared recently (23, 
40, 61, 65, 95). Persons interested in the use of discriminant analysis 
for classification purposes would do well to study the comprehensive 
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review by Hodges (40). Lubin (61) emphasized maximization of the 
multiple correlation ratio for predicting qualitative attributes from quanti- 
tative data. Moonan’s article (65) highlighted Rao’s approach to dis. 
criminant analysis, while Tiedeman’s review (95) emphasized the direc. 
tional purpose of the discriminant function. 

In two recent reviews, Cronbach (25, 26) considered the use of the dis- 
criminant function in the study of psychometric profiles, but held the linear 
model inappropriate for psychological data such as those obtained from 
the Rorschach Test. Brown (17) considered discriminators for the purpose 


of classification in clinical psychology and discussed a quadratic form 
of the discriminant function. 
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CHAPTER V 


Factor Analysis! 


HERBERT SOLOMON and BENJAMIN ROSNER 


Tue last report on factor analysis in the Review oF EDUCATIONAL 
ResEARCH (26) covered the period 1939-1951. Altho the period covered 
by the present survey is much shorter, the number of papers reviewed is 
nearly as great as that for the previous period. This accelerated research 
activity emphasizes the present interest in the state of the art. Examination 
of the literature reveals that most of the research is on the mechanics of fac- 
tor analysis as a research technic or its application in some subject area. 
However, the new research does not overlook the origins of factor analysis, 
for it also includes some attempts at evaluating the validity of factor 
analysis as a mathematical model to describe observed human behavioral 
responses in terms of operationally meaningful human parameters. 


Validity of Factor Analysis 


Factor analysis, as we know it today, emerged as a research technic 
thru the creation of a mathematical model which attempted to portray 
the relationships between a response to a test of mental ability and the 
human factors which produced it. In view of recent emphasis on model 
building in the behavioral sciences it is interesting to note that the factor 
analysts were already at work in model building some 50 years ago. 
The model builders of today have the same goal as the original factor 
analysts; namely, the creation of a model which both reproduces and 
explains observed data. Altho there are many factor models which reproduce 
observed data (and some which do not), it remains an open question as 
to which models satisfy both conditions. 


New Models 


The term radex theory is used by Guttman (84) to describe a new model 
which attempts both to reproduce and to explain a correlation matrix 
developed from test scores of mental ability. Two distinct notions are 
involved in a radex. One is that of a difference in kind between tests, 
and the other is that of a difference in degree. Each of these notions gives 
rise to a simple order system; for instance, for all tests of the same kind, 
say numerical ability, there will be differences in the degree of their com- 
plexity; such a set of test variables of the same kind is called a simplex. 
Correspondingly, all tests of the same degree of complexity differ among 
themselves only in the kind of ability they define. Since here a law of order 
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from “least” to “most” makes no sense, a circular order is defined for test 
variables obeying such a law and the set of test variables is called a 
circumplex. The radex approach is essentially one of order-factors as con- 
trasted with the previous common-factors approaches of Spearman, Thurs- 
stone, and others. Aside from Thomson’s “sampling of the bonds” model, 
the radex appears to be the first different model since Spearman’s two-factor 
theory of 50 years ago. Several sets of numerical data previously factored 
by other methods were analyzed successfully in terms of radex theory (84). 
Gabriel (65) demonstrated the simplex structure of the Progressive Matrices 


Test and the practical applications of simplex theory with regard to predic- 
tion or screening. 


Use and Abuse 


In a humorous paper about factor analysts, McNemar (108) highlighted 
some of the important problems and listed 10 pitfalls for factor analysts 
in studying human behavior. Muhsam (113), in a somewhat similar vein, 
compared factor analysts to Plato’s cave dwellers. Accordingly, he factored 
seven measures on hens’ eggs to produce three factors which he said did 
not seem to “reflect faithfully the actual characteristics of an egg.” In brief, 
a correlation matrix was reproduced but not necessarily explained. Eysenck 
(51) discussed the role of factor analysis for both the creation and valida- 
tion of hypotheses and contended that the misuse of factor analysis stemmed 
from the incorrect substitution of one goal for another. In another article 
(47), he discussed the logical basis of factor analysis on three levels: de- 
scribing data, suggesting hypotheses, and supporting hypotheses. Guilford 
(78) called attention to 10 common faults in factor analysis and stressed 
the importance of initial planning and design. An exposition of the rationale 
of factor analysis from its beginnings to the present was given by Peel 
(118) and by Vincent (146). Edwards and Horst (46) suggested that 
controlling the social desirability of the items in Q-technic studies would 
lead to clearer interpretations of the factors. Problems of recognizing and 


interpreting composite factors were considered by French (58) and 
Zimmerman (163). 


Methods of Factor Analysis 


Many articles on various aspects of methodology relate to (a) compari- 
son of methods, (b) new twists on old methods, (c) efficiency of comput- 
ing, and (d) statistical developments arising out of the complex sampling 
distribution theory one has come to expect in factor analysis. 


Comparisons 


Factorial methods have been compared by a number of investigators. 
Moursy (112) discussed three types of factorial organization: general and 
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group factors, general factors only, and group factors only. He then ana- 
lyzed a matrix of 20 tests of a single activity by all major methods and con- 
cluded that Burt’s hierarchical structure was best. Howie (96) compared 
Burt’s group-factor method and Thurstone’s centroid procedure by factoring 
36 variables. He found essentially the same factors with but minor diver- 
gencies. In a paper on multiple-group methods for common-factor analysis, 
Guttman (83) reviewed early work on the subject, presented an outline 
of computational technics, and provided algebraic interpretations of the 
common factors as a linear combination of weighted test scores. Stephenson 
(139) presented the differences between the R and Q technics of factor 
analysis. Cattell (27, 30) described and compared in some detail the 
0, P, Q, and R technics. 

Guttman’s (82) image theory presents a multiple-correlation, as con- 
trasted with a partial-correlation, approach to th: notion of “commonness” 
in a set of variables. He showed that this approach permits interpreting 
or explaining the intercorrelations of any set of variables by means of their 
mutual multiple regressions and that common-factor theory in the Spear- 
man-Thurstone sense is a special type of image theory. Extending this line 
of thought, Guttman (86) considered conditions that are necessary for 
common-factor analysis and concluded that a parsimonious common-factor 
system may not exist for a given universe of content or variables. Since 
the common-factor approach begins with the assumption that a parsimo- 
nious system exists, this paper raises an important question about the 
validity of the model. 

Considering the analysis of qualitative data, Burt (20) discussed the 
problems of scale analysis and advocated the use of factor analysis for 
qualitative as well as quantitative data. He compared the results of factor 
analysis with Guttman’s scale analysis technics. Guttman (85) discussed 
various types of methodology employed in the analysis of qualitative data 
and arrived at conclusions similar to those reached by Burt. Green (71) 
pointed out basic similarities between latent-structure analysis and factor 
analysis. Torgerson and Green (143) reported on a factor analysis of 
subjectmatter experts to discover the dimensionality of opinion related to 
the intrinsic validity of achievement tests. 


Operations on the Correlation Matrix and the Factor Matrix 


Within the framework of the Thurstone approach to factor analysis there 
is always interest in rotation methods, transformation of matrices, and the 
concept of simple structure. Adcock (1) described the use of a symmetric 
transformation matrix to yield orthogonal reference axes directly without 
resort to rotational procedures. Green (72) developed three analytic meth- 
ods for obtaining an orthogonal factor matrix which closely resembles a 
given oblique factor matrix. The same problem was also attacked by Gibson 
(68) whose procedure approximates two of the solutions offered by Green. 
Carroll (25) presented an iterative method for approaching simple structure 
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which avoids subjective decisions and yields a unique solution. However, 
the computational routine is extremely laborious. Renshaw (125) demon- 
strated that rotations must be guided by the positions of the test points 
relative to the reference axes. Thurstone (142) presented a single-plane 
method for rotation to simple structure without using graphic devices. 
Burt (21) treated the “law of progressive sign reversals” as a logical conse- 
quence of the structural relations obtaining among factors and decided 
that the bipolar factor matrix should be so arranged that a minimal number 
of sign changes is manifest in each column. Sandler (134) presented a 
technic for facilitating the rotation of factor axes based on an equivalence 
between persons and tests. A method for estimating factor loadings without 
computing the correlation matrix was developed by Wherry and Winer 


(148). 


Statistical Developments 


Problems of statistical inference arising out of factor analysis are of 
necessity quite complicated. The multivariate framework and consequent 
sampling distribution problems often demand intractable mathematical 
analyses. Bartlett (11) made some general remarks on factor analysis in 
psychology as a statistician sees it. Burt (22) reviewed and criticized 
various significance tests in factor analysis and recommended Bartlett’s 
test for deciding when to stop factoring. Using the likelihood ratio criterion, 
Rippe (127) arrived at a large sample chi-square test to test the complete- 
ness of orthogonal and oblique factor solutions and to determine the sig- 
nificance of common-factor loadings. Slater (138) demonstrated that a 
matrix with negative correlations can be factored without transformation. 
He offered a test of significance for the residual correlations under some 
restrictions on the latent roots of the matrix. Lawley (100) discussed a 
modified method of estimation in factor analysis and some large sample 
results. Wold (158) presented some experiments in which artificial statis- 
tical data were constructed in accordance with a specified factor structure 
and then the factor structure estimated from the artificial data. Gourlay 
(69) illustrated the difficulties arising in the use of tetrachoric correlations 
in factor analysis. Ahmavaara (3) presented a mathematical theory of the 
invariance of a factor structure under selection. Kestelman (97) demon- 
strated that even when factors are more numerous than tests, factor loadings 
may be obtained which are in standard form and uncorrelated. Spearman’s 
earliest formulas used in factor analysis were reviewed by Vincent (146). 
Harris (91) reported on the relation of factors derived from raw scores to 
factors derived from deviation scores. He also presented evidence for Burt’s 
reciprocity principle. 


Computational Methods 


In recent years attention has been given to the mechanical operations 
involved in factor analysis. Harman (89) presented the “square root” 
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method for the solution of a set of simultaneous linear equations and 
demonstrated its application in computations for multiple-group analysis. 
A short-cut method for the computation of inverse matrices was reported by 
Andree (4). Wilson and Comrey (150) modified Thurstone’s diagonal 
method and found good agreement between the modified diagonal and 
complete centroid methods. 

The trend toward the analysis of higher order matrices has emphasized 
the need for high-speed electronic digital computers. Wrigley and Neuhaus 
(159) described the operations of the Ordvac electronic digital computer 
in the factorization of a matrix. Harman and Harper (90) reported on an 
electronic psychological factor rotator to facilitate the rotation of axes. 


Applications 


It is now half a century since Spearman introduced his single-factor 
model to explain the intercorrelations in a battery of mental tests. Since 
that time factor analysis has emerged as one of the most popular research 
tools in the behavioral sciences. In recent years it has also been applied 
to many variables other than psychological tests. Indeed the diversity of 
its application during the past three years is clearly illustrated by Heath’s 
analysis (92) of pattern-making for the garment industry, Adkins’ attempt 
(2) to redefine the structure of the American Psychological Association, 
Twedt’s analysis (144) of advertising readership, and Muhsam’s (113) 
theoretical and humorous discussion of factor analysis as applied to the 
physical dimensions of a hen’s egg. In psychology and in education, inves- 
tigators have continued to identify and delineate the primary mental and 
emotional characteristics, e.g., Cattell (28) and Zimmerman (164), and to 
organize observable behavior into descriptive and diagnostic classifications, 
e.g., Ryans (130) and Wittenborn (153). The foregoing overview of re- 
cent trends in the application of factor analysis should serve to orient the 
reader to the more detailed exposition of the following sections. 


Test Construction 


Zimmerman (162) showed that factor loadings vary with increasing item 
difficulties in a spatial-visualization test. The easier test had a high loading 
on perceptual speed, the test of medium difficulty on the space factor, and 
the most difficult test on visualization. Differences in the factor content of 
right and wrong scores were revealed by Fruchter (61), and formulas were 
derived to maximize the purity of the measures. The score on a test per- 
formed under various speeds was demonstrated by Myers (114) to be a 
function of two orthogonal factors, the ability to answer correctly and the 
ability to answer quickly. Gaier, Lee, and McQuitty (66) found five 
factors for response patterns in a test of logical inference. Wittenborn and 
others (157) compared factor loadings of 20 symptoms rated by two raters 
to show that symptom clusters remain stable despite rater differences. 
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Zachert and Friedman (160) showed that the factorial content of the Air. 
crew Classification Batteries remains similar under wartime and peacetime 
conditions. 


Identification of Primary Abilities 


In order to improve their measures of previously identified factors, Guil- 
ford, Fruchter, and Zimmerman (80) factored a battery of 39 experimental 
and seven “reference” tests to yield 13 factors. An attempt to duplicate a 
psychomotor test by analogy in printed form failed completely. Fruchter 
(63) extracted eight previously identified factors from a battery composed 
in part of the AGCT, ACTB, Gray-Votaw General Achievement Test, and 
the DAT. A revised rotational solution of Thurstone’s PMA Test Battery by 
Zimmerman (164) verified six original factors and introduced four new 
ones. 


Space Factor 


The nature of the space factor has aroused considerable discussion and 
experimentation. Zimmerman (161) reviewed the hypotheses explaining 
the space factor and emphasized the hypothesis of a continuum ranging 
from perceptual speed to intellectual visualization. Fruchter (62) reviewed 
the literature on spatial ability and cited problem areas and sources of dif- 
ficulty in applying factor analytic methods. Michael (111), on the other 
hand, described a four-stage experimental program to identify the psycho- 
logical processes associated with the spatial-visualization continuum. 

The most thoro study of tests in the perceptual area was contributed 
by Roff (128). Seventy tests of mental and psychomotor behavior, plus 
biographical information, were factored to determine the pattern of inter- 
relations among perceptual tests and other factors. A family of eight 
perceptual factors was extracted. Bair (7) factored 17 clerical aptitude 
measures and one general intelligence test to yield three factors: (a) per- 
ceptual analysis, (b) speed, and (c) comprehension of verbal relationships. 
Motion picture tests of perceptual functions and a battery of paper and 
pencil tests were analyzed by Fruchter and Mahan (64) to reveal three addi- 
tional factors: (a) pattern perception, (b) movement detection, and (c) 
division of attention. The perceptual speed-intellectualized visualization 
factor was isolated, furthermore, in studies of reasoning (110) and mathe- 
matical ability (9). 


Reasoning Factor 


During the past three years investigators have attempted to isolate and 
define more precisely the abilities in the realm of reasoning. Guilford and 
others (79) applied Tucker’s adaptation of Hotelling’s iterative procedure 
for determining principal components to a battery of 54 tests designed to 
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study reasoning abilities. Four reasoning factors, among others, were ob- 
tained, and comparisons were made with other similar analyses. Green and 
others (73) extracted 12 factors from a battery of 32 mental tests of which 
seven related to reasoning ability. These factors may be grouped to define 
the deductive and inductive reasoning domains. A second-order analysis of 
reasoning abilities by Matin and Adkins (110) revealed six factors, five of 
which were tentatively interpreted as: (a) precision in formation and use 
of verbal concepts, (b) general verbal fluency, (c) visualizing spatial 
constancy during movement, (d) speed in analysis, and (e) flexibility in 
analysis. Botzum (16) and Pemberton (120) studied the closure factor 
in relation to reasoning and other cognitive processes, and Corter (40) 
extracted seven reasoning abilities, noting among others previously cited, 
an “academic” factor. 


Psychomotor Abilities 


Fleishman (55) contributed much to the study of psychomotor abilities; 
he suggested that different factors were involved at different stages in the 
performance of psychomotor tests. In a later article (56), he presented a 
review of factor-analytic studies in the area of psychomotor performance, 
discussing nine motor and three nonmotor factors. Rimoldi (126) investi- 
gated the speed factor and analyzed 59 tests of physical and mental speed 
which yielded nine first-order and four second-order factors. No general 
speed factor was found. There was no relationship between motor and 
cognitive speed, but some between gross and fine body movements. 


Other Primary Abilities 


Knoell and Harris (98) studied word-fluency, revealing eight factors. 
Two were interpreted as (a) speed of fluency and (b) versatility, while six 
doublets were left undefined. 

Wilson, Guilford, and Christensen (151) presented the results of three 
approaches to the measurement of originality. “Cleverness” type tests 
yielded the highest loadings on the originality factor, but were deemed 
too costly in terms of time and energy to determine scores. 

Barakat (9) extracted four factors from a battery of tests designed to 
measure mathematical and related abilities: (a) a general intelligence 
factor—“g,” (b) mathematical ability, (c) verbal ability, and (d) visuo- 
spatial ability. He concluded that the number factor was by no means 
unitary or innate. 

A rather novel application of factor analysis was Gunn’s (81) analysis 
of the qualities of poetry writing. He found two factors, a general esthetic 
factor and a technical factor, suggesting that the appreciation of poetry 
was similar to that of music and art. Somewhat related is Osgood’s analysis 
of semantic description and meaning (117). Preliminary results were re- 
ported for “evaluative,” “strength,” and “activity” factors. 
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Concerned with the grading of athletes, Highmore and Taylor (93) 
factored eight tests of athletic ability to yield four factors: (a) general 
athletic ability, (b) running, (c) throwing, and (d) jumping. They sug. 
gested that an adequate measure of all-round athletic ability might be ob. 
tained by three tests: sprinting, football throw-in, and high jump. 


The Differentiation Hypothesis 


A number of investigators have continued to study the effects of age on 
the general factor, “g.” Burt (19) reviewed the literature and concluded 
that as age increases, the general factor accounts for a distinctly smaller 
proportion of the total amount of variation, whereas group factors play an 
increasingly predominant role. Peel and Graham (119), however, analyzed 
the subtests of four performance tests and found no evidence of differentia- 
tion between the ages of 9 and 1014. An analysis of the subtests of the 
WISC by Hagen (87) revealed that the general factor maintained its rela- 
tive importance between the ages of 5 and 13. 


Mental and Personality Assessment 


An analysis of the Wechsler-Bellevue subtest scores of mental patients 
was reported by Cohen (39) to yield three correlated factors: (a) verbal 
organization, (b) nonverbal organization, and (c) freedom from distrac- 
tability. Later Cohen (38) described the subtests of the W-B in terms of 
their known factor content. Birren (13), working with elderly subjects. 
reported that the W-B measures what they have learned, rather than what 
they can learn. 

Williams and Lawrence (149) reported on the factorization of the 
MMPI, Rorschach, and W-B data. Earlier, Lotsof (107) had demonstrated 
that four factors, (a) verbal intelligence, (b) elaboration, (c) productivity, 
and (d) individuality, were of utmost importance in interpreting Ror- 
schach categories. 

A study by Wheeler, Little, and Lehner (147) suggested that the MMP! 
subtests could differentiate psychotics from neurotics, but could not dif- 
ferentiate the type of neurosis. Tyler (145) presented additional MMP! 
factors for a sample of female graduate students. 

Banks and Keir (8) factored 32 “most diagnostic” items of the Bern- 
reuter Personality Inventory and extracted three factors: (a) general 
“nervousness,” (b) introversion, and (c) dominance-self-sufficiency. Brog- 
den (18) reported 10 factors in the Allport-Vernon Test, A Study of 
Values; and Lorr and Murney (105) extracted two from the Hildreth Feel- 
ing and Attitude Scales. Raino (124) reported a factor analysis of Szondi 
choices and related two factors to behavioral criteria. Five factors were 
extracted by Guertin (74) from the Bender-Gestalt records of mental 
patients. A Norwegian interest schedule was factored by Braaten (17) to 
reveal four factors. An interesting analysis of OSS Situational Tests was 
reported by Sakoda (132). Cattell (31) presented the theory and construc- 
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tion of the 12-factor /.P.A.T. Junior Personality Quiz. Ford and Tyler 
(57) analyzed the Terman-Miles M-F Test and indicated two dimensions to 
sexuality: emotionality and interest. 

Eysenck (48) stressed the use of factor analysis for relating experimental 
facts to a hierarchical model of personality. A forward step was contributed 
by Baehr (6) who factored three Guilford inventories to yield four second- 
order factors: (a) emotional stability, (b) primary function, (c) activity, 
and (d) emotional instability. These second-order factors were compared 
with the Heymans-Wiersma conceptual scheme of temperament traits. 

Cattell (28), Cattell and Cross (32), Cattell and Gruen (33), and Cattell 
and Horowitz (34) investigated the nature of various source traits and re- 
lated them to other personality variables. Eysenck (50) studied the schi- 
zothymia-cyclothymia dimension and concluded that there were no quali- 
tative differences in abnormal and normal behavior, but rather a continuum 
from rational to extreme psychotic. Takals (140) presented a factor 
analysis of visual and auditory memory stimuli and succeeded in dif- 
ferentiating visual and auditory type persons. An attempt to isolate factors 
descriptive of handwriting was reported by Lorr, Lepine, and Golder (104), 
in which length, breadth, and slant were revealed as important components. 
Barnes (10) factored 24 tests designed to tap the developmental levels of 
psychosexual behavior, and concluded that the oral, anal, and phallic char- 
acteristics cited by analysts did not form independent factors. 

Wittenborn (153), Wittenborn and Bailey (154), Wittenborn and Holz- 
berg (155), Wittenborn and Weiss (156), and Wittenborn and others 
(157) applied factor analysis to behavioral symptoms and found symptom 
clustering sufficiently stable for descriptive purposes. Research reported 
by Lorr, Rubinstein, and Jenkins (106) suggested that many common 
psychiatric syndromes could be identified factorially to provide more ef- 
ficient rating and classification schema. Degan (42), working independently, 
reported similar results, and O’Connor (115) identified eight psycho- 
neurotic reaction patterns by his analysis of behavioral symptoms and 
complaints. 

The area of rigidity has been investigated in a number of studies. Cattell 
and Winder (36) reviewed the concepts of structural rigidity and pro- 
posed two research designs for checking the number and nature of its 
factors. Pullen and Stagner (123) investigated the relationship between 
rigidity (tendency to persist in a previously made response) and shock 
therapy, and concluded that the shock tends to break up rigidity. A factori- 
zation of 16 tests of various rigidity components was reported by Scheier 
and Ferguson (135) who found no rigidity component common to all 
tests, 


Special Methods 


The use of a variety of factor analytic technics for the study of indi- 
viduals has been demonstrated by a number of writers. Cattell (27, 29) 
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suggested the use of the P-technic as a method of quantitative psycho- 
analysis. Burt and Watson (24) applied the Q-technic to give a picture of 
changes in a person over a span of years, and Block (14) demonstrated the 
O-technic. Guertin (75, 76) and Guertin and Zilaitis (77) used inverted 
factor analysis to study schizophrenics, whereas Geist (67) employed it to 
compare MMPI profiles with medical psychiatric diagnosis. Secord, Dukes, 
and Bevan (137) applied Tryon’s cluster analysis to photographs to de- 
termine the relationships between physiognomy and personality. Fiedler 
(52) used Q-technic to distinguish one school of therapy from another, and 
experts from non-experts. The factors revealed that the nature of the 
therapeutic relationship is one of expertness rather than school. 


Classroom Performance 


The factors associated with academic success have been studied by 
various writers. McQuary (109) reported on some relationships between 
nonintellectual characteristics and academic achievement, suggesting, fur- 
thermore, that factor analysis was useful in studying sociological and an- 
thropological data. A factorization of course grades at the U. S. Coast 
Guard Academy, reported by French and others (59), indicated that 
mathematical, verbal, reading, and spatial abilities were important for 
academic success. Doyle (44) and Driscoll (45) analyzed standard intelli- 
gence and achievement test data to reveal (a) cognitive, (b) verbal, and 
(c) numerical factors. 

Bendig (12), Coffman (37), Hampton (88), and Ryans (130), and 
Ryans and Wandt (131) factored ratings of teacher-behavior to determine 
criteria of teacher-effectiveness. Coffman, furthermore, suggested that factor 
analysis be used to test for changes in students’ values as a result of teacher 
efforts. An inverted analysis of Cattell’s 16 PF Test made on 32 teachers 
was reported by Lamke (99) to reveal no clean-cut good or bad teachers, 
but to indicate several patterns for good as well as bad teachers. 

Other studies relating to classroom behavior were reported by Ryans 
(129) on teachers’ educational viewpoints, and by Ash and Hobaugh (5) 
on some ratable characteristics of instructional films. 


Attitudes 


A second-order analysis of the Fels Parent Behavior Scales by Lorr and 
Jenkins (103) indicated that parent behavior, as measured, could be de- 
scribed by three factors: (a) dependence-encouraging, (b) democracy of 
child training, (c) organization and effectiveness of home control. 

Tryon’s cluster analysis was applied by Pope (122) to 25 “guess-who” 
items in order to study children’s values. He concluded that different value 
clusters appear at low and high socio-economic levels for both boys and 
girls. 

The intercorrelations among 24 conditions reported for 273 maladjusted 
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children were factored by Burt and Howard (23). No general factor was 
found, but two environmental and personal deficiency factors were ex- 
tracted. The authors concluded that cases of maladjustment hardly form a 
single homogeneous group. In a similar study, Lorr and Jenkins (102) 
reported that two second-order factors, social aggression and maladapta- 
tion, accounted for the patterns of poor adjustment among children. 

Much attention has been given lately to the isolation and understanding 
of the dimensions of social attitude and prejudice. Hofstaetter (94) fac- 
tored 19 items designed to identify prejudices and extracted five factors: 
(a) anti-negroism, (b) anti-semitism, (c) national pride, (d) puritanism, 
and (e) state socialism. He concluded that there was no justification for 
combining these factors into one type, the “authoritarian personality,” for 
example. O’Neil and Levinson (116), continuing their studies of authori- 
tarianism and religion, factor analyzed 32 items from the F, E, R, and T 
scales of the T.A.P. Social Attitude Battery, and extracted four factors de- 
scribing religious ideology and authoritarianism: (a) religious conven- 
tionalism, (b) authoritarian submission, (c) a masculine strength facade, 
and (d) a factor tentatively named “moralistic control.” Sanai (133) 
analyzed a questionnaire of 30 statements on social, political, and religious 
topics to yield four additional factors. 

Eysenck (49) extracted two social attitudes from an inventory ad- 
ministered to 263 Germans and compared them with factors found in 
English, Swedish, and American populations. He concluded that the struc- 
ture of attitudes in these four countries is very similar. A study of sex dif- 
ferences in attitude organization by Diggory (43) indicated that differ- 
ences in attitude organization parallel role differences and are not restricted 
merely to sex differences. Lorr (101) succeeded in isolating two major 
U. S. social attitudes: (a) political isolationism and economic conserva- 
tism vs. internationalism and economic socialism, and (b) individual free- 
dom from social control vs. control by government and social codes. A study 
by Bock and Husain (15) demonstrated the value of applying factor 
analysis to sociometric data. 


Industrial and Business 


An analysis of a point rating job evaluation plan covering clerical workers 
was reported by Grant (70) to reveal a lack of independence among the 
variables and low correlations among ratings and job worth. Similarly, 
Howard and Schutz (95) reported that a single factor, “skil! demands,” 
accounted for 99 percent of the variance in job level. 

Schreiber, Smith, and Harrel (136) analyzed the responses to a 34-item 
questionnaire of 379 nonsupervisory, nonacademic employees at the Uni- 
versity of Illinois to reveal two clearly defined dimensions of employee 
attitudes: (a) job satisfaction, and (b) knowledge of employee benefits. 
On the other hand, Fleishman (54) applied the Wherry-Gaylord Iterative 
Factor Analysis to 150 items designed to measure supervisory behavior and 


431 








REVIEW OF EDUCATIONAL RESEARCH Vol. XXIV, No. 5 





extracted two independent leadership dimensions: (a) consideration, and 
(b) initiating structure. Wilson and others (152) factored the responses 
of 98 skilled tradesmen to a 25-item questionnaire relating to supervisory 
behavior regarding employees. Four factors were extracted: (a) supervisor- 
subordinate rapport, (b) congenial work group, (c) informal leadership, 
and (d) group unity. 

Petrie and Powell (121) reported on the selection of nurses in England. 
A factor analysis of 18 ratings on 126 nurses revealed three factors: (a) 
general nursing ability, (b) intellectual capacity, and (c) personal rela- 
tionships. In a factor analysis of an interest test, Fitzpatrick and Wiseman 
(53) reported technical performance scores to be clearly distinguishable 
from intelligence and achievement scores, but similar to technical scores 
derived from a school report on technical ability, interest, and ambitions. 

Fruchter (60) presented an attempt to classify occupations on the basis 
of aptitude profiles required of trainees. The intercorrelations of aptitude 
and criteria scores were analyzed to determine the factor content of the 
criteria for purposes of job classification. 

Thomas (141) used a modified procedure of Tryon’s cluster analysis 
to identify eight clusters of office operations which may aid in the selection 
of predictor variables to determine job success. The studies by Adkins (2) 
on the structure of the APA, by Heath on pattern-making, and by Twedt 
(144) on advertising readership have been alluded to earlier. 
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CHAPTER VI 


Applications of Variance-Covariance Designs 


in Educational Research 


LEONARD S. KOGAN 


Tus nonexhaustive survey of applications of variance-covariance designs 
will pay scant attention to basic principles of experimental design or the re- 
sults of studies cited. Its simple purpose is to direct the reader to recent 
research that employs a variety of designs; the suggestions these studies 
offer may be of help to him in planning his own research. Several excellent 
texts (31, 50, 52) and general articles (18, 29, 37, 46, 47) have been 
added to the methodological literature; these should be consulted for ra- 
tionale and computational details. 


Variance Designs 


Single-Classification Variance Designs 


The most commonly used single-classification design is the two-group 


case, traditionally handled by the t-test. Because of its familiarity no ap- 
plications of the two-group case need be cited. When more than two groups 
are involved, the means may be compared with an over-all F-test. Jones 
(26) used a single-classification design in studying the language handicap 
of Welsh speaking children and found significant differences among five 
schools in reading quotients but not in verbal or nonverbal intelligence 
quotients. Towner and Galloway (48), in determining whether medical 
students recognized previously used items in a Cancer Knowledge Test, 
compared results for items classified into four groups depending on pre- 
vious usage; the analysis might have been strengthened by taking into ac- 
count interstudent differences in extent of recognition. 

An extension of the single-classification design, involving a “split- 
plot” principle, was used by Anderson (4) in a study of the relationship 
between teacher morale and student achievement. Analysis of variance was 
carried out for a design involving a classification of teachers within schools 
and schools within achievement groups. 


Double-Classification Variance Designs 


The precision and scope of studies can often be enhanced thru the use 
of randomized blocks and factorial design. In educational research it has 
become almost routine to form blocks on bases such as school, teacher, sex, 
1Q, socioeconomic level, or initial scores on the criterion variable. The 
purpose of such groupings is to increase the precision of experimental 
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comparisons by deliberately arranging the design so that known sources 
of variability can be separated from both the experimental comparisons 
and the estimate of error. In applying factorial design, the investigator 
is interested in not only the main effects of several experimental factors 
but also their interactions. The importance of considering factorial design 
is pointed up by a large-scale study by Guetzkow, Kelly, and McKeachie 
(23) of the relative effectiveness of recitation, discussion, and tutorial 
methods in college teaching which was carried out as a single classification 
design with matched groups. Failing to find significant differences among 
the methods, the authors noted that the classroom situation involves so 
many variables and their interactions that attempts to establish the relative 
merit of a general method of teaching are likely to prove inconclusive. 

Pathael, Fitzpatrick, and Bischof (39) used a 34 design in a study of 
retention associated with three teaching procedures in four classrooms. Re- 
sults of a final examination indicated that the method of using IBM answer 
sheets plus discussion of questions resulted in somewhat better performance 
than use of self-scoring tests or discussion of new questions. A 33 
design was employed by Loomer (32) in a study of the effect of learning 
experiences in art on drawing. One classification was based on a control 
group with classroom experience only, an experimental group who also 
witnessed a sound color film, and a second experimental group who in 
addition went on a field trip. The second classification derived from the 
assignment of three themes as subjects for the drawings. 

A novel use of a 26 design was made by Glidewell (21) to compare 
predictions from clinical data with actual behavior in a group situation. 
Predicted responses were classified into two categories with six observers 
rating the subjects’ responses to prepared statements made by a role-playing 
leader. ; 

Altho analysis of variance is most commonly used to compare means, it 
may, when certain assumptions are met, be used to estimate components of 
variance ascribable to the different factors being investigated. Lucow (33) 
estimated variance components for a 26 design in which textbook- 
centered chemistry teaching was compared with laboratory-centered teach- 
ing. To provide a rationale for dealing primarily with variances, the author 
asserted that the goal of the course was increase in variance from pretest to 
post-test, under the assumption that greater variance in a group indicates 
greater expression of individual differences. 


Triple-Classification Variance Designs 


Curr and Gourlay (15) used a 2X28 design in evaluating the effects 
of remedial reading. One dichotomous classification contrasted a remedial 
group and a control group matched for intelligence and achievement. The 
other dichotomous classification was based on selection of pupils by teachers 
or by testing. Eight schools participated. The observations on which the 
analysis of variance was carried out were the gains shown by the pupils 
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on tests such as reading comprehension, arithmetic, and spelling. The au- 
thors claimed that in this study the analysis of variance of gains was 
nearly equal in precision to analysis of covariance. 

Mitzel and Rabinowitz (36) employed a 2X44 design to study verbal 
behavior by the teacher in the classroom. Two observers recorded and cate- 
gorized statements made by four elementary-school teachers during four 
visits. Analyses were carried out for a number of categories relating to 
social-emotional climate in the classroom. The dependent variate was ex- 
pressed as a percent of total number of statements, and transformation 
was made to angles before analysis. 

Russell and Bendig (43) used a 5X2>14 design to study the relation 
of student ratings of five instructors to course achievement with academic 
aptitude controlled. Two extreme achievement groups were formed, a “plus” 
group with obtained grades at least one-half standard deviation above 
grades predicted by the ACE and a “minus” group with grades at least 
one-half standard deviation below prediction. This study illustrates the 
inferential difficulties encountered when 14 different rating scales are 
thrown into a single analysis of variance with no adjustment to equalize 
means and variabilities. There is also the possibility that the investigators 
failed to take account of unequal frequencies in the subclasses of the design. 

Two additional studies illustrate useful designs not found in the educa- 
tional research literature surveyed. Kenny and Bijou (27) studied the re- 
lationship between richness of fantasy production and ambiguity of The- 
matic Apperception Test cards. Their experimental design is of interest 
because it involved two judges, three levels of ambiguity, and three ex- 
aminers assigned to independent groups and had to be analyzed by split- 
plot principles. The frequently useful device of “confounding” is still not 
too commonly found in research design but was utilized in simple form by 
Ross, Rupel, and Grant (42) in a 2X22 factorial study of the effects 
of impersonal distraction, personal heckling, and electric shock upon per- 
formance in a card sorting test. The difference between two examiners was 
deliberately confounded with the second order interaction. 


Multiple Classification with Disproportionate Frequencies 


The difficulties encountered in handling multiple-classification designs 
with disproportionate frequencies have become familiar to most investiga- 
tors in recent years. Collier (13) and Peggs (40) added to the available 
literature on this topic. Collier criticized the common procedure of dis- 
carding cases in order to equalize frequencies as involving the loss not only 
of information but also of degrees of freedom and power. 

Several empirical studies are of relevance. In a large-scale investigation 
of attitudes of parents and children toward educational practices Scar- 
borough (44) used a large number of double-classification designs in- 
volving region of the country and educational status of parents. To avoid 
the labor of dealing with disproportionate frequencies, an ingenious se- 
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quential approximation was developed. Altho his conclusions may not have 
been affected, the procedure is based on the fallacious premise that a fac. 
torial design may be treated as if it were a single-classification design. 
Cook and Hoyt (14) applied variance design and Tsao’s procedure (49) 
for handling disproportionate subclass frequencies to the practical problem 
of determining the number and nature of norm groups for the Minnesota 
Teacher Attitude Inventory. Analysis of returns from a stratified sample 
of 1934 teachers, involving attention to variables such as training level of 
teachers, faculty size, and grade level, yielded information necessary to 
arrive at decisions regarding the setting up of norm groups. Anastasi and 
D’Angelo (3) also took cognizance of unequal subclass frequencies in utiliz- 
ing a 2X22 design based on sex, race, and neighborhood in a com. 
parison of 100 children in mean sentence length and Goodenough IQ. 


The Latin Square Principle in Design 


Designs utilizing principles of the Latin square were not frequently 
found in the literature surveyed; this may reflect a response to the com- 
monly repeated warning that interactions among variables should be negli- 
gible before this design is used. A single 33 Latin square involving three 
methods of presenting diagrams to children (filmslide, wall chart, black- 
board), three intelligence levels, and three schools was used in a study by 
Edwards and Parkin (19). Mech (35) utilized the Latin square principle 
in a “crossover” analysis of variance in which efficiency in addition was 
tested in the classroom under alternating conditions of noise and quiet. 
Unlike the practice in educational research, replicated Latin squares are 
commonly used in psychological research (29). This design, for example, 
was used by Maradie (34) who analyzed two replications of a 1010 
square in a study of productivity on the Rorschach as a function of order 
of presentation of cards. Contributing to methodology, Edwards (17) pro- 
vided an article on balanced Latin square designs and Archer (5) an 
article on the use of Greco-Latin designs for learning studies. 


Analysis of Covariance 


The use of analysis of covariance has been characteristic of recent educa- 
tional research. Gourlay (22) presented an excellent discussion of the use 
of covariance, describing in detail how it may be used to increase the pre- 
cision of experiments or to study relations between variates measured after 
treatment. His criticism of the use of nonrandom or “intact” groups is 
especially pertinent for educational research. Abelson (1) generalized the 
Neyman-Johnson technic to any number of predictor variables. 


Single-Classification Covariance Designs 


Covariance analysis is especially valuable for increasing precision in 
comparison of criterion scores when the criterion scores can be regressed 
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on related initial scores. This frequently takes the form of adjusting final 
scores for initial differences in pre-experimental scores on the same variable. 
Boger (9) in a control-group study of the effects of perceptual training on 
intelligence test scores repeated IQ tests at the beginning and end of his 
study and evaluated differences in final status after adjusting for initial 
differences. Bouchard (10) compared arithmetic performance of a control 
group and three experimental groups differing in testing-training expe- 
rience, taking account of initial performance. Watts (51) in a study of the 
effects of mathematics training on intelligence test scores pretested and post- 
tested a mathematics group and a nonmathematics group and compared 
gains adjusted for pretest scores. In this study gains rather than the 
usual final scores were used as criterion measures. It is of interest to note 
that results are necessarily the same regardless of whether gains or final 
scores are used as the dependent variate. An interesting choice of a criterion 
variable was made in a study by Auble and Mech (6) who compared 
the effect of two conditions of verbal reinforcement on a simple arithmetic 
task. Scores were cumulated over a period of several experimental days and 
the two conditions of delay were evaluated on the terminal day by com- 
paring square root transformations of cumulated correct responses adjusted 
for number of correct responses on the initial day. This method yielded 
a comparison of effects for the entire experimental period rather than the 
usual comparison which is made with terminal scores. 

In a study of the relative contributions of general mathematics vs. 
algebra to mathematical competence Beckmann (7) compared gains on a 
specially constructed test using intelligence test scores obtained during the 
middle of the experimental year as a covariate. Generally speaking, infer- 
ence from covariance analysis is more complex when the supplementary 
variable may itself be affected by the experimental treatments (22). 

Bendig (8) compared ratings of instructors and courses by students in 
five college classes after adjusting for achievement test scores. Jones (25) 
compared means in verbal intelligence of English speaking and Welsh 
speaking children from the same locale, adjusting for reading ability. 
Keston (28) compared the scores on a final Music Preference Test of an 
experimental group exposed to both classical music and discussion, with 
the scores of a control group exposed to classical music only, adjusting for 
initial music preference and one of 10 other variables, e.g., musical pitch, 
musical training. Mouly (38) compared a remedial reading group with a 
control group on hour-point ratio taking into account initial differences in 
reading ability and ACE scores. 

Buswell (12) studied the performance of sociometrically accepted and 
rejected children on a number of variables, such as arithmetic and study 
skills, using covariance for intelligence test scores and socioeconomic status. 
This study is a particularly good example of the analysis of relationships 
by covariance technics. 

Alkire (2) compared mathematics achievement of high-school students 
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preparing for college and students not preparing for college after con. 
trolling for intelligence test scores. Burke and Anderson (11) compared 
achievement test scores of 1939 and 1950 elementary-school pupils adjust- 
ing for intelligence test scores by covariance. Frederiksen and Schrader 
(20) compared the academic achievement of veterans and nonveterans in 
each of 25 colleges by means of a covariance technic developed by Gul- 
liksen and Wilks. Adjustment was made for scores on tests of scholastic 
aptitude and achievement as well as high-school grades. 


Multiple-Classification Covariance Designs 


Silverstein and others (45) employed covariance in a study of the 
effects of special training on the intelligibility of male and female talkers. 
Double classification was based on sex differences and an experimental- 
control dichotomy. Subclass frequencies were randomly equalized to avoid 
disproportionality and adjustment was made in terms of scores on an initial 
intelligibility test. In the literature surveyed this study was the only one 
that referred to the vexing problem of how best to make comparisons of 
individual means following a significant over-all F-test, a problem currently 
being given much consideration by statisticians. 

Dressel, Schmid, and Kincaid (16), using a 25 design, studied the ef- 
fect of writing frequency upon essay writing proficiency by selecting ex- 
treme groups of freshmen who had spent the most and least time writing 
essays. Final theme ratings were compared, adjusting for initial theme 
ratings at the beginning of the year. 

Using a 2X2 design, Perlman (41) compared the performance in 
scientific problem solving of students in a historically oriented vs. a con- 
temporary oriented college physical science laboratory course. Covariance 
adjustments were made for ACE scores and scores on a scientific thinking 
pretest. 

Kruglak (30) compared individual instruction and a demonstration 
method in the physics laboratory using a 24 design. Final criterion per- 
formance was adjusted for ACE scores, mathematics background, and a 
pretest on knowledge of physics. Since classes were unequal in size, cases 
were randomly rejected to avoid disproportionality. This study is par- 
ticularly excellent in the careful attention given to experimental controls 
over and above the statistical controls furnished by covariance. 

Hoyt and others (24) employed a split-plot 2433 factorial design 
in a study where two methods of instruction in drawing were compared. 
Cross-classifications included methods, classes, types of drawing, and raters. 
Postexperimental ratings of drawing ability were adjusted in terms of pre- 
experimental ratings of drawing ability only after a multiple correlation 
analysis indicated that several other supplementary variables did not sig- 
nificantly increase predictive efficiency. The analysis of covariance involved 
the handling of disproportionate frequencies in the subclasses by Tsao’s 
method. 
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n- The problem of disproportionate frequencies in double classification 
ed covariance design was investigated by Edwards and Parkin (19) who found 
st- that applying conventional methods to moderately disproportionate frequen- 
er cies did not produce much difference from an analysis carried out after the 
in frequencies had been made proportional by dropping cases randomly. The 
il- frequencies did not depart significantly from proportionality, however, and 
‘ic no generalization could be made as to when disproportionality makes a 
difference. 
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CHAPTER VII 


An Elementary Survey of Statistical Decision Theory 


MEYER A. GIRSHICK 


Sraristicat decision theory was originated some 15 years ago by Abra- 
ham Wald. In the past decade, great strides have been made in this 
field by Wald and others. However, so far these developments have had but 
little impact on experimental research in the social and physical sciences. 
There appear to be two basic reasons for this: one is the natural lag 
between theory and practice which so often occurs in science; the other, 
which in the present case may be more fundamental, is that decision 
theory to date has been too much concerned with the mathematical foun- 
dations of the subject and less with its immediate application. Curiously 
enough, here is a situation in which the foundational development, difficult 
as it is, is easier than the application to actual problems. 

What is decision theory? In its broadest sense, decision theory deals 
with the problem of decision making in the face of uncertainty. But since 
life is beset with all kinds of uncertainties, this definition is all-embracing 
and consequently nonilluminating. The truth of the matter is that the 
methods of decision theory are all-embracing, and could be said to en- 
compass the whole science of inductive reasoning. But being statisticians 
and not philosophers, we attempt to narrow down the field. This narrowing 
down consists mainly in a specification of the kind of uncertainty with 
which we are dealing and an insistence that the decisions to be made 
must be based on observations obtained from an experiment. 

The notions of decision theory may be introduced with an illustra- 
tion which, tho mathematically simple, exhibits all essential features. A 
few years ago Professor Merrill Flood, then at the RAND Corporation and 
presently at Columbia University, requested assistance in solving a prob- 
lem which he and his group encountered in their experimental work with 
learning models. For our purpose it is not essential to know the exact 
nature of the experiment except that it dealt with the question of how a 
person utilizes available information to make decisions. The problem, 
which is an abstraction of the real situation, can be presented in the 
form of the now famous “two-armed bandit” problem. In this formu- 
lation, we are given a slot machine with two arms, a right and left arm. 
The probability of paying off is different for each arm, one having a 
probability @ of paying off, and the other a probability 9, with @ greater 
than 6. When either arm pays off, the amount is a monetary units. The 
subject is told the values of w and 0, but he is not told which arm is 
associated with which probability. The subject is allowed to pull either 
arm at any time for a total of N pulls. The problem is to study how, at 


448 





December 1954 SuRVEY OF STATISTICAL DECISION 





each stage, the subject utilizes the information thus far obtained to de- 
cide which arm to pull next so as to maximize his total take. 

As often happens in social science research, the formulation of an ex- 
perimental situation is immensely easier than the interpretation of the 
collected data. In this particular situation, the experimenters quickly 
realized that it would be impossible to analyze the behavior of the subject 
in any meaningful way since how to classify behavior in a situation of this 
sort or what constitutes an optimal mode of behavior for the subject 
was not known. It may be of interest to note that, after almost two years 
of battling with the “two-armed bandit” by a number of people, it still 
remains essentially unsolved, except for some special cases. One such 
special case | shall consider here in detail. It is the case in which the sum 
of m and 0 is unity, but their values are not announced to the subject. 
In addition, for the sake of somewhat greater generality, we assume 
that the payoff of the right arm is a and of the left arm is b, with a not 
necessarily equal to b. It is easily shown that this special case is equiv- 
alent to a simpler situation, namely, that of a coin-tossing experiment. 
The experiment can be reformulated in the following manner. The ex- 
perimenter selects a biased coin, the bias being subject to his choice. 
That is, he selects a coin that has a probability @ of falling “head” and 
hence a probability 8=1—w of falling “tail,’ where is any number 
between zero and one, and is at the control of the experimenter.’ The 
same coin is tossed N times where N is some integer, say 100, and is 
part of the design of the experiment. Prior to each toss, the subject has 
to bet on its outcome, and as an inducement to the subject, we assume: if 
he bets “head” and “head” appears, he receives a monetary units; if he 
bets “tail,” and “head” appears, he receives nothing. Similarly, if he 
bets “tail” and “tail” appears, he receives b monetary units: but if he 
bets “tail” and “head” appears, he again receives nothing. For the sake 
of concreteness, we shall take a = 100, b = 150. The payoff to the sub- 
ject for each bet is summarized in Table 1. 


TABLE 1.—PAYOFF TO SUBJECT IN A SPECIAL CASE 
OF “TWO-ARMED BANDIT” PROBLEM 





Payoff for given bet 
Tail 


Outcome of toss 














*How the experimenter can perform an experiment with a coin having any desired 
built-in bias without overtaxing the ingenuity of the U. S. Mint will not be discussed 
here. It suffices to say that even more complicated experiments are being performed 
on electronic machines by employing random numbers. How to utilize these numbers 
most economically is itself an interesting decision problem which has been solved. 
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We shall now exhibit the essential elements of this experiment. As 
we shall see later, these elements are basic to every statistical decision 
problem: (a) After the first toss, the subject can base his decisions on in- 
formation supplied by the outcomes of the previous tosses. (b) The in- 
formation will come to the subject thru a random mechanism, namely. 
the tossing of a coin. (c) The outcome of the toss depends in a probability 
sense on the bias in the coin. (d) The extent of the bias, here the value 
of w is completely at the control of the experimenter and is unknown 
to the subject. (e) The payoff provides the subject’s inducement to do 
well, and he will presumably wish to maximize the total income for the 
N bets. 

The experimenter would clearly be interested in the decision procedure 
employed by the subject. Note that what we are after is not what bets 
the subject makes, because this can be easily noted, but rather the deci- 
sion procedure which he employs. More technically, what we are after 
is the strategy of the subject. From the outset, it is imperative that we un- 
derstand the notion of a strategy. When the layman says that he has a 
strategy, he usually implies that he has what he thinks is a good way 
of behaving in a given situation. This is not what is meant by the word 
strategy in the technical sense. The word is to be taken in an absolutely 
neutral sense. It carries with it no value judgment. A strategy simply 
is a prescribed mode of behavior, be it good, bad, or indifferent. It 
is a complete description of what a person is going to do under every 
possible eventuality with which he might be confronted in the situation 
under consideration. Improvising on the spur of the moment does not 
constitute a strategy. This type of behavior cannot be evaluated or its 
consequences predicted. On the other hand, a strategy considered as a 
rule of behavior can be evaluated since it prescribes not only what one 
is going to do at this instant, but also what one’s behavior will be at all 
times whenever he is confronted with situations of this sort, even tho 
the eventualities in the situations may differ. 

All too frequently research workers have confused the decision which 
they make after an experiment has been performed, with a strategy for deci- 
sion making in experiments of the type contemplated. A decision based 
on the information supplied by an experiment must be an outgrowth 
of a rule previously determined, and it is the consequences of the rule 
and not the particular decision that lends itself to statistical evaluation. 

An objection usually raised is that in many situations it is humanly 
impossible to formulate a strategy in all its detail. For example, it would 
probably take endless volumes to describe a reasonable strategy for the 
game of chess. Altho this is true, it should be pointed out that (a) the 
notion of a strategy is still tremendously fruitful and often makes it 
possible to prove theorems about games like chess even tho we cannot 
write down the instructions which would constitute a strategy, and (b) 
in many situations in statistics it has been found possible to characterize 
not only a single strategy but whole classes of strategies which have 
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optimal properties. The concept of a strategy will become clearer as we 
progress with this discussion. 

Let us return to the coin-tossing experiment. In order to understand 
what is meant for the subject to have a strategy in this experiment, we 
must visualize the possibility of the subject not being present at the experi- 
ment. Instead he sends a proxy who cannot make any decisions of his 
own volition and must follow the instructions of the subject to the letter. 
In order for the proxy to function in this experiment, the subject has to 
specify in detail how he is to behave in every possible situation, that is, 
what he is to do initially before the first toss of the coin, and thereafter 
how he is to bet for every possible outcome of the previous tosses. 

As an illustration, consider the case in which the experimenter will toss 
the coin only three times. A strategy for the subject must specify (always 
think of the instructions he gives to the proxy) what bet he will make on 
the first toss when he has no information, and what he will do on the 
second toss after he knows the outcome of the first toss. More specifically, 
if H represents head and T tail, the possible strategies available to the 
statistician are summarized in Table 2. Strategy 1 instructs the proxy 
to bet “head” on the second toss no matter what he sees on the first toss. 


TABLE 2.—POSSIBLE BETTING STRATEGIES ON 
SECOND TOSS OF A COIN 





Strategy on second toss 





Outcome of first toss 








1 2 3 4 

tae 1 2 3 4 5 
LS ov @ eine alidck 00.0 plkerw ang dE eae eae es, H H T T 
BG Sas. 5:56 ain wise a a ck hb lees OER ee eee eee H T H T 





Strategy 2 instructs the proxy to bet on the second toss whatever he sees 
on the first toss. Strategy 3 tells the proxy to reverse the situation. Strategy 
4 tells him to bet “tail” no matter what he sees. 

After two tosses, the strategies available for the third toss are 16 in 
number and are summarized in Table 3. Consider, for example, strategy 


TABLE 3.—POSSIBLE BETTING STRATEGIES ON 
THIRD TOSS OF A COIN 


Strategy on third toss 





Outcome of first two tosses 











i 2 3 4 § 6 7 8 9 10 11 12 13 14 15 16 

1 2 3 4 S$ 6 7 8 9 10 11 12 13 14 15 16 17 
RE ee HHHHTHHHTTTHTTTT 
H Divs «+0 bean Gea HHHTHHTTHHTTHTTT 
ee HHTHHTHTHTHTTHTT 
T os HTHHHTTHTHHTTTHT 
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2. This strategy says: bet “head” unless the results of both tosses are 
tails. In strategy 7, to give a second example, the instruction is to ignore 
the first toss and to make the bet agree with the outcome of the second toss, 

In general, the number of strategies available to the subject after \ 


N 

tosses is 2? which is a huge number for any sizable N. Moreover, a grand 
strategy for all NV tosses would consist of a selection of an initial strategy and 
then one from each of a sequence of tables such as Tables 2 and 3. Even 
this huge number does not exhaust all the possibilities available to the 
subject. For example, suppose he contemplates the following strategy for 
betting on the third toss: He bets “head” if both previous tosses were 
heads; he bets “tail” if both previous tosses were tails, but if either HT 
or TH appears, he is hesitant to make a positive bet. To express this 
hesitancy in the form of a strategy, he might decide to carry along with 
him his own coin with bias subject to his own control (probably in 
favor of tail because of the higher payoff) which he will toss in this 
case and bet according to the outcome. He can clearly also accomplish 
the same objective by the following device: Prior to the experiment, he 
can toss the coin to decide whether in Table 3 he will employ strategy 
2 or 12. Space does not permit going into any detailed discussion of these 
types of strategies which are known as randomized or mixed strategies, but 
it suffices to say they form an integral part of game theory and statistical 
decision theory, and we shall encounter them again in what follows. 

At first glance, the immense number of strategies available to the subject 
might appear highly discouraging. However, as we shall see later, most of 
these strategies are not worthy of consideration in rational behavior, 
and a major task of statistical decision theory is to discover methods for 
separating the wheat from the chaff. Before the winnowing can be under- 
taken, however, we must have a clear definition in mind of what is 
“wheat” and what is “chaff,” or in more technical language, what con- 
stitutes an “admissible” and what constitutes an “inadmissible” strategy. 

Because a strategy is a rule of behavior and not a particular choice 
of an action at a given moment when such an action is called for, its 
evaluation is based not on the payoff at any specified time but rather 
on the expected payoff in the long run. From this point of view we can 
now examine several of the strategies enumerated above, and eventually 
formulate the notions of admissible and inadmissible classes of strategies, 
as well as the concept of “minimax” strategies which one encounters in 
game theory and decision theory. 

Let us begin with the examination of possible strategies for the first 
bet of the subject when he has no information about the behavior of the 
coin. The class of nonrandomized strategies available to him consists of the 
following two: S,—always bet “head”; S,—always bet “tail.” Assume 
now that the experimenter’s coin has probability @ of falling head. If 
we represent by /(S,\w) the expected income from strategy S, and by 
1(S,\w) the expected income from strategy S,, then since the subject 
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receives 100 monetary units if he bets “head” and head appears, and 150 
monetary units if he bets “tail” and tail appears, but receives nothing 
otherwise, we have: 


1(S,\@) =100Xw+0X (1—w) =100 
1(S.|o) =0Xo+150X (1—@) =150(1—«) 


If we plot /(S,@) and J(S,'w) as functions of @, we get two straight 
lines intersecting at @=.6, as can be seen in Figure I. That is, when 


1(S,\@)=I(S,|@), we have 100®=—150(1—w) or 


150 

= 5690 

A randomized strategy S, consists in the subject’s selecting a coin with 
a known probability r of falling head, and always betting according to the 
outcome of a toss of this coin. Let /(S,|w) denote the expected income if 
strategy S, is used. Then, since on any toss the subject’s coin has probability 
r of falling head and his expected income in that case is /(S,'w), while 
his coin has probability (1 — r) of falling tail and his expected income is 
then /(S,\w), it follows that: 


1(S,@) =r (S,|@) + (1—r) 1(S,|o) 
=250rw—150r—1500-++150. 





Clearly, for any r, /(S,\@) is a straight line which passes thru the intersec- 
tion point of /(S,\@) and /(S,|@). This is true since when o=.6, 


I(S,|@) =1(S,|@) =60 


and /(S,|@) 60 for all r. An example of such a line with r= .6 is given in 
Figure I. 

For values of w between zero and .6, the subject will do best by playing 
strategy S, rather than any other strategy, pure or randomized so that the 
maximum attainable expected income when @ is less than .6 is given by 
I(S,\@). For values of w greater than .6, the subject would do better by 
employing strategy S, than by employing any other strategy available to 
him. Thus, the maximum attainable expected income when © is greater 
than .6 is given by /(S,w). The curve of the maximum attainable income is 
then given in Figure I by the two heavy line segments joining at o=—.6. 
For any value of w, the ordinate of this curve represents the maximum 
amount the subject can get, on the average, if he knows the experimenter’s 
choice of w. We call this curve the maximum attainable income curve on a 
single toss. (It will be remembered that the income curve for any strategy 
represents the expected payoff in the long run, that is, the average income 
the subject expects to receive if the strategy were used consistently in a long 
series of repetitions of the experiment.) 

Suppose the experimenter were to behave in the following manner: He 
wishes to select a value of w such that the most that the subject can attain, 
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on the average, even if the value of were known to him, is as small as 
possible. Clearly then, the experimenter would choose the value w=.6. In 
his turn, the subject might examine possible expected incomes that he can 
get for various values of r at his disposal and come to the conclusion that 
with one exception, namely, r—=.6, he runs the risk of getting less than 60 
monetary units, on the average, for some choices of w by the experimenter. 


Expected Income Curves for Specified Strategies with no Observations 
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FIGURE I 


However, if he chooses r=.6, then he is guaranteed that his expected in- 
come will be 60 monetary units no matter what the experimenter chooses 
for the value of w. The value @=.6 for the experimenter, and the value 
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r=.6 for the subject are known in the language of game theory as minimax 
strategies. Roughly speaking, minimax strategies in a game have the prop- 
erty that to one of the players, the strategy guarantees that his expected 
income will be at least a certain amount regardless of the second player’s 
strategy and to the second player it guarantees that his expected loss will 
be at most a certain amour* whatever be the strategy of his opponent. Thus, 
in a certain sense, minimax strategies are safe strategies, and one would be 
tempted to use them in a situation where the opponent is assumed to be 
rational and is motivated by a desire to obtain the maximum from the 
contest. 

No method can be given here for determining minimax strategies in the 
general case. However, there is available a simple criterion for testing 
whether a given strategy is minimax. The criterion is this: A minimax 
strategy for the first player, say, is such that (a) it guarantees him an ex- 
pected income of at least a certain amount no matter what the second 
player does; and (b) the second player can be shown to possess a strategy 
which will make it possible, on the average, for him to hold the first 
player to this amount. 

How does the situation change as the experiment progresses and the 
subject is confronted with observations on previous tosses of the coin? We 
have seen that the number of strategies available to him become enormously 
multiplied and supply a larger class from which to choose; consequently, 
he has a better chance of finding one which is in some sense optimal. Again, 
without observations, there exists no strategy that will guarantee the attain- 
ment, on the average, of the values of the maximum income curve for all 
values of «. However, with a proper choice of a strategy based on observa- 
tions, this objective comes closer to being realized, and in fact, as the num- 
ber of observations increase, the expected income curves based on properly 
chosen strategies are no longer straight lines (see Figure III) and ap- 
proach the maximum attainable income curve. What constitutes properly 
chosen or admissible strategies will be taken up next. 

Let us first consider the income curves of each of the four strategies based 
on a single observation, given in Table 2. The income curve for strategy 1 
which we designate by /(1!@) is clearly the same as /(S,/w) and that of 
strategy 4 which we designate by 1(4|@) is that of 1(S,|w). However, for 
strategies 2 and 3, the income curves /(2|w) and /(3|w) are parabolas. The 
income curve /(2|w) is computed as follows: The only time when the sub- 
ject receives nonzero income is when he bets on head and head appears, 
or he bets on tail and tail appears. In the former case, he gets 100 units, in 
the latter case 150. Strategy 2 tells him to bet on head if head appears in 
the first toss. The probability that he wins on head is the product of the 
probabilities that the first toss is a head (and hence he will bet on head) 
and the second toss is a head (and hence he will win). Similarly, the prob- 
ability that he wins on tail is the product of the probabilities that the first 
toss is a tail and the second toss a tail. Consequently, the probability that 
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Expected Income Curves for Specified Strategies with One Observation 


150 


125 


= 75 
° 
oO 
£ 
a=] 
® 
3 
2 50 
WwW 





0.2 0.4 “i 06 08 1.0 


FIGURE II 
he wins 100 units is w* and the probability that he wins 150 units is 
(1—w)*. It follows that: 
1(2\) ==100w?+-150(1—w) ?=250m?—300w-+-150. 
A similar argument shows that the income from strategy 3 is given by: 
1(3|\@=2500(1—o). 


The four income curves are plotted in Figure II. In addition to these four 
curves, Figure II also has the curve for the minimax strategy in which the 
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first observation is ignored, namely, /(S._|@), as well as the expected in- 
come /(S\w) from a randomized strategy S which selects strategies 2 and 4 
(see Table 2) with probabilities 94 and 1/5, respectively. This curve is com- 
puted from: 


I1(S\w) =%41(2|m) +141 (4|o). 
A study of Figure II shows the following: 


1. The maximum attainable income curve is given by curve A for values 
of @ greater than .6, and by curve D for values of @ less than .6. 

2. At w=.6, all the curves yield an expected income of 60 monetary 
units. This will also be true for any probability mixtures of these curves, 
i.e., randomized strategies. Thus, if the experimenter selects w—.6, the 
subject cannot in the long run get any more than 60 units no matter what 
strategy he employs. For any other value of «, there is a strategy for the 
subject which will yield him an expected income of more than 60 units. 
For example, for o=—.8 either strategy A or B will accomplish this, or 
for @=.2, one of the three strategies B, F, or D will accomplish this. The 
value ®—.6 may, therefore, be considered the least favorable value of w 
from the point of view of the subject. 

3. There is a strategy, namely, strategy 2, whose income curve B begins 
to approach the maximum attainable income curve and, indeed, has three 
points in common with it, namely, the expected incomes for o=0, .6, and 
1, respectively. 

4. Strategy 2 guarantees the subject for all values of an income which, 
in the long run, is never less than 60 monetary units. Since, if the experi- 
menter chooses ®—.6, there is no strategy which will guarantee the subject 
an expected income of more than 60 units, it follows from the criterion 
given above that strategy 2 is a minimax strategy. 

5. Strategy S., yielding the constant income curve E of 60 monetary units 
is also a minimax strategy. This strategy is one in which the subject ignores 
the result of the first observation, and tosses his own coin with r—.6 to 
determine his bet on the second toss of the experimenter’s coin. However, 
strategy 2 which makes use of the first observation and takes advantage 
of any departure on the part of the experimenter from Ais minimax strategy 
w=.6, is manifestly superior to strategy S.,. For, if we examine the corre- 
sponding income curves B and E, we see that while they coincide for o=.6, 
curve B is higher than curve E for all other values of w. Thus, while both 
strategies are “safe” strategies, strategy 2 which utilizes the observation 
is as good, in terms of expected income, as strategy S, for the least favor- 
able value of «, and better everywhere else. From the point of view of 
rational behavior, it is clear that the constant income minimax strategy S . 
should be discarded in favor of strategy 2, that is, the constant income 
minimax strategy can be classed as inadmissible. 

6. Consider now strategy 3 yielding the income curve C. Here again we 
see that this curve is never above curve F so that it too should be classed 
as inadmissible. 
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We have now exhibited two strategies which are inadmissible, one strat- 
egy which is inadmissible among all general strategies, and one which is 
inadmissible among the minimax strategies. What constitutes an inadmis- 
sible strategy should now be clear from the context. A strategy S is inadmis- 
sible if there exists another strategy S* yielding an expected income which 
is never smaller than that of S for any value of w, and is greater for some 
values of «. By contrast, then, a strategy S is admissible if no S* satisfying 
the above conditions exists. . 

A little reflection will show that it is usually easier to demonstrate that a 
strategy is inadmissible than that it is admissible, for in the former case 
it is only necessary to produce an example of a better strategy, while in the 
latter case, one apparently is required to examine the whole class of strat- 
egies to be sure that no better one exists. But fortunately mathematical 
ingenuity comes into play here, and makes it possible often to characterize 
quite simply the whole class of admissible strategies. Thus, for example. 
in the case of one observation, all admissible strategies are of the form, “If 
head appeared in the first toss, bet on head, but if tail appeared in the first 
toss, then toss a coin with probability r for head, and bet according to its 
outcome”; or alternatively, “If head appeared in the first toss, toss a coin 
with probability r for head and bet according to its outcome, but if tail ap- 
peared, bet on tail.” 

To get a better conception of the way in which decision theory separates 
the wheat from the chaff, let us consider the situation confronting the sub- 
ject prior to the (n-+-1)st toss of the coin, ignoring for the moment the 
problem of dovetailing the strategies from toss to toss. As we have pointed 
out, even if we discount the possibility of randomized strategies, the num- 


n 
ber of nonrandomized strategies available to him is 2? . However, decision 


theory tells us that we never need to consider all of the a? strategies since 
we can ignore the order in the sequences of heads and tails in n tosses, and 
base our decision entirely on the number of heads in each of these se- 
quences without affecting the realizable income. This immediately cuts 
down the number of nonrandomized strategies to 2"*'. Next, decision theory 
shows that of the totality of all possible strategies, randomized as well as 
nonrandomized, only those strategies are admissible which are of the form. 
“If the number of heads in the sequence of n tosses is less than c, bet on 
tail; if the number of heads in the n tosses is greater than c, bet on head: 
if the number of heads is equal to c, then toss a coin with probability r for 
head, and bet according to its outcome.” Here c is an integer between zero 
and n, inclusive, and 0<r<1. Moreover, it can also be shown that given 
any strategy which is not of this form, we can find one of this form which 
never is below it in terms of the resulting income curve. That is, in technical 
language, this class is complete. 

We see then that after the subject has observed the outcome of n tosses 
of the coin, a strategy for him, if he behaves in a reasonable way, consists 
in a choice of an integer c and a fraction r. We designate such a strategy 
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Expected Income Curves for Specified Strategies with Five Observations 
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FIGURE III 


by S... For a given w, the expected income from the (n-+-1) st toss, based 
on such a strategy, is given by: 


n cl 
T(S._\) = 1000 > }(3)wi(1—w)*-1+150(1—«) >} (wie) 
j=ct1 j=0 


+(100wr+150(1—w)(1—r))(2)w*(1—w)"-* 
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It can be shown that the minimax strategy (again ignoring the problem of 
dovetailing) is given by those values of c and r such that when w—.6, 
the probability prior to the experiment that the subject will bet on head 
is also .6. Moreover, it can also be shown that this strategy is the only 
admissible minimax strategy with n observations. ; 

Figure III gives the income curves for the following three strategies for 
n=5: S,.o, Ss.r6, which is minimax, and S,;. A few remarks about the 
curves in this figure are in order. 

1. The minimax strategy which yielded the income curve /(2\w) in 
Figure II remains a possible minimax strategy for this situation as well. 
It is a strategy which requires ignoring four out of the five observations 
and betting on the outcome of the fifth. But this minimax strategy is uni- 
formly worse than the strategy S;, z., whose income curve is plotted in 
Figure III, except for the points o—0, o=1, and w=—.6, at all of which 
the two curves agree. 

2. It is important to emphasize here once more that if the experimenter 
wishes to hold the subject to a maximum expected payoff of 60 units, he 
will never depart from w=.6. On the other hand, even if the subject had 
good reason to believe that this is the intention of the experimenter, it is 
still true that a rational way of behaving for him is to play the minimax 
strategy c—3, r—.76. This argument might appear more reasonable if we 
changed the setting from the rarefied atmosphere of pure research in learn- 
ing models, to a gambling house where the subject has to play red or 
black in roulette. Here the experimenter, now the house, will definitely 
try to construct a roulette wheel in such a way as to hold the maximum 
expected payoff to a minimum, but an optimal strategy for the subject. 
now the gambler, after observing the outcome of n plays, is a betting 
strategy which guarantees him at least that minimum, but at the same 
time takes advantage of any possible bias in the machine that is due to 
mechanical or other reasons. 

3. If we examine the income curves of the remaining two strategies in 
Figure III, we see that curve A, the expected income from the strategy 
c=1, r—0, gets closer to the maximum attainable income curve for w greater 
than .6, but this is accomplished at the expense of a relatively large depar- 
ture from the maximum attainable income curve for @ less than .6. Curve 
B for the strategy c—3, r—0, on the other hand, reverses this situation. This 
is a general property of admissible strategies. They can be chosen so as to 
yield a substantial improvement for some values of at the expense of 
other values. This suggests that a choice of a particular strategy from a 
class of admissible strategies would be facilitated if some prior information 
as to the possible choices of the value of w by the experimenter were avail- 
able. The type of information most commonly considered in decision theory 
is that of a known a priori probability distribution on the possible values 
of w with which the experimenter selects a particular w for his coin. An 
admissible strategy in this case is one which maximizes now the expected 
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income with respect to this distribution. Such a strategy is known as a 

Bayes strategy. Bayes strategies play an important role in decision theory 

because they often are admissible and their totality generally forms a 

complete class of strategies. 

By this time. the reader may begin to be puzzled by the twist which has 

occurred during the course of the exposition. As a scientist interested in 

learning, he undoubtedly expected that some statistical wisdom would be 

unfolded as to how the learning experiment should be designed, conducted, 

and analyzed. Instead, he finds that thus far the major portion of the dis- 

cussion concerned itself with the rationality of the behavior of the subject. 

However, this twist, altho not altogether intentional, is highly fortunate 

from a conceptual point of view. In the general formulation of decision 

theory, nature plays the role of the experimenter in our illustration, and the 

decision-maker or statistician plays the role of the subject. Without wishing 

to give any anthropomorphic interpretations to the physical universe, it is 

highly instructive to consider nature as continuously tossing coins and per- 

forming other experiments with various random devices. The order we 

observe in the chaos around us can be conceived of as having the same 

structure as the order we observe in the sequence of tosses of a coin, and 

what is unknown to us in the random mechanism employed by nature is of 
the same structure as the bias in the coin put in by the experimenter 
in our illustration. The consequences of our actions and decision depend 
on these unknowns basically in the same manner as the expected income 
for the subject depends upon the value of @ chosen by the experimenter. 
Finally, we can gain partial information as to the structure of the random 
device employed by nature by observing outcomes from it and basing our 
strategies upon these observations. But, unlike the case considered, obser- 
vations are usually costly, and a large part of decision theory is devoted 
to the problem of how to balance the cost of observations against the addi- 
tional information gained from them. 

Technical difficulties and space do not permit the formal development 
of the above ideas in their full generality. At best we can give a short out- 
line of some of the fundamental concepts involved and relate these to the 
illustration previously given. One of the fundamental concepts is that of an 
experiment. As is the case with the word strategy the everyday meaning of 
the word is too vague and ambiguous to be useful. When a layman thinks 
of an experiment, he may think of vials and test tubes if he has in mind 
a chemical experiment; of dials and electronic tubes if he thinks of an 
experiment in physics; or of a group of students memorizing nonsense syl- 
lables if he thinks of an experiment in education or psychology. However, 
in decision theory, an experiment is not what a scientist performs in a 
laboratory at any given time, but rather a set-up conceived in somewhat 
more abstract terms in which there are (a) a collection of possible ob- 
servables, (b) a collection of possible states of nature in the given situa- 
tion, and (c) for every state of nature in this collection, a probability 
distribution over the observables. The collection of observables is tech- 
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nically known as the outcome space, and is denoted by Z with elements :. 
The collection of possible states of nature is called the space of strategies 
for nature and is denoted by Q with elements w. Then, for each state of 
nature, the probability distribution on Z is denoted by p(z|m). In most sta- 
tistical problems, w plays the role of a parameter in the distribution, and Q 
is called the parameter space. 

In terms of our previous example, the outcome space Z consists of all 
possible sequences of heads and tails obtained by tossing a coin N times. 
The space Q is the interval of real numbers between zero and one from 
which the experimenter selects a value w as the bias for the coin. For any 
«, the probability distribution on Z is the binomial distribution. Thus, if 
z is a sequence in which the first n tosses contain r heads and n-r tails 
in a specified order, 

p(z|o) =o" (1—w)"". 


As another example, the space of outcomes Z might consist of all possible 
time periods required by a subject to memorize 25 nonsense syllables. An 
element @ in Q might consist of a pair (u, 6) where p is the mean and 
o is the standard deviation in the distribution of these time intervals. The 
form of the distribution is assumed to be normal, that is, p(z|w) is a normal 
density. 

We note that a data-collecting procedure is not necessarily an experi- 
ment as we have defined it, unless the above ingredients can be delineated 
in some manner. What usually makes an experiment poorly designed is the 
lack of specification of what constitutes Z, 2, and the class of probability 
distributions that are being contemplated. 

In addition to the notion of experiment, two other notions, that of an 
action space and that of a loss function, are basic in any statistical de- 
cision problem. An aétion space is simply a collection of all possible actions 
or decisions available to the decision maker. The action space is usually 
denoted by A with elements a. Presumably, the problem of the decision 
maker is to choose an action from this collection, that is, an a from A. 
The specification of possible actions available to the decision maker is 
again less trivial than it appears. Thus, an investigator often may not 
realize that other courses of action are open to him, such as deferring 
judgment until further information is available, and because of this, he is 
forced into making an unwise decision. 

The concept of a loss function is somewhat more difficult. To begin with, 
decision theory puts the motivation of the decision maker in a some- 
what gloomy setting. Instead of evaluating each decision or action in terms 
of how much he can gain by it, he evaluates it in terms of how little he can 
lose by it compared with another decision or action. But whether we 
evaluate in terms of gain or loss, it is important at the outset to realize 
that this evaluation must depend not only on what he does but also on the 
actual state of nature. For example, the payoff to our subject in the illustra- 
tive example depended not only on what he bet, but also on the outcome 
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of the toss which in turn depended on the @ selected by the experimenter. 
For any a in A and any @ in Q, the loss function is represented by 
L(@, a). 

Here again we see that decision theory demands a great deal of the deci- 
sion maker. It demands that he be in a position to evaluate numerically 
for every possible state of nature in the situation under consideration the 
consequences of any of the actions that he might take. It has been argued 
by many that no human being possesses the ability so to evaluate the utility 
of the various actions in all possible states. While this is not the place 
to enter into an extensive discussion of this subject, we shall merely state 
that the inability of the decision maker to formulate clearly the loss func- 
tion is, in fact, a stumbling block in determining what a rational mode 
of behavior is for him. More bluntly, it is impossible to tell a person what 
is an optimal way for him to behave if he is unable to formulate clearly 
what he is after. However, the inability to define a specific loss function 
should not prevent us from studying the consequences of various hypo- 
thetical loss functions. Also, great progress has recently been made in ob- 
taining important results in decision theory which are independent of the 
particular form of the loss function. Finally, decision theory acts as a gadfly 
to the research worker. It says to him: You cannot solve your problem 
unless you more clearly define your goal and the consequences of your 
decisions. Such a prodding is likely to be healthy. In addition, it must be 
emphasized that the classical Neyman-Pearson way of evaluating decisions 
by the size of the probability of being wrong is still available, and is easily 
incorporated in this new framework for statistics. 

As we have noted, the outcome space Z, the parameter space Q, the 
probability distributions p(z\w), the action space A, and the loss function 
L(@, a) constitute the elements of a statistical decision problem. A strategy 
for the statistician in this situation is called a decision function. It is a 
rule which specifies what action a in A will be selected for every outcome 
z in Z. A decision function is usually designated by d and the class of all 
possible decision functions or rules by D. We have already seen examples 
of decision functions in the learning model problem. Note that for any z, 
d(z) is an element in A. 

For any decision function d and any @ in Q, the loss is given by 
L(w, d(z)). The expected loss, averaged for all z with respect to the prob- 
ability distribution p(z\w), is known as the risk and is designated by 
p(w, d). For a fixed d, p(w, d) when plotted against w yields a risk curve 
which is not unlike the income curves we studied in the learning model 
problem. We wish to emphasize here that a decision function is usually 
selected without the knowledge of what @ is, since @ is at the control 
of nature. Minimax decision functions, admissible decision functions, in- 
admissible decision functions, and complete classes of decision functions 
are defined in a manner parallel to that for the illustrative example. In 
addition to defining these classes of strategies, decision theory often offers 
constructive methods for generating them. 
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As has been mentioned, decision theory is in the process of develop- 
ment and is not yet a completed science. It takes for its domain the problem 
of rational behavior in the face of unknown states of nature. It gives a 
logical foundation and framework to mathematical statistics which it has 
never had before. It grapples with the problem of design of experi- 
ments in a manner never before attempted in statistics. It insists that cost 
considerations and consequences of decisions be taken into account in 
every statistical investigation. In doing so, it has brought sequential theory 
into the general framework and has made sequential decision procedures 
the rule rather than the exception. It bridges the gap existing in classical 
statistics between testing hypotheses and estimation by showing that the 
distinction lies merely in the structure of A, which in one case is finite 
and in the other infinite. In its attempt to clarify the nature of the sta- 
tistical decision process, it exposes the limitations as to what is attainable 
in the face of ignorance and raises serious problems concerning rational 
behavior. One such problem, for example, is how to select a specific strategy 
from a class of admissible strategies if no prior information is available. 
This problem becomes particularly pertinent if we question the wisdom of 
the minimax approach against a neutral opponent, such as nature is for the 
statistician. Finally because decision theory deals with the problem of 
decision making in its greatest generality, every particular statistical prob- 
lem can be immediately placed in a general framework and thus exhibit 
its ramifications and implications. 

To illustrate the last remark, let us return once more to the coin-tossing 
experiment. We observe that the subject is never testing any hypothesis, 
null or otherwise. Instead, he takes actions; namely, he makes bets. Simi- 
larly, in many research situations there is a need to change the emphasis 
from hypothesis testing to decision making. For example, suppose w repre- 
sents the proportion of children who can learn certain arithmetical con- 
cepts in a given period of time by a new teaching method. Then the deci- 
sion to recommend this method or not, based on results of an experimental 
group is not dissimilar to the decisions of the subject in making bets on 
head or tail. Also, as in the case of the coin-tossing experiment, the prefer- 
ence of the research worker might be to recommend the new method for 
some values of @, and not to recommend it for other values. The important 
thing is that the region in which the true value of lies is unknown, and 
the decision has to be based on the observations. The only intrinsic dif- 
ference between the two problems is that while the subject had a precise 
evaluation of what it meant for him to be right or wrong, the research 
worker is not in this happy situation. This makes the problem more dif- 
ficult, but not, on that account, essentially different. 

In the coin-tossing experiment, the subject was forced to make decisions 
after every toss, and his decisions were based on an increasing amount 
of information. It is not difficult to visualize the educational experiment 
mentioned above being carried out over a period of years, with additional 
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information being obtained each year. It is also conceivable that the re- 

search worker might be willing or be asked to commit himself on a tenta- 

tive yearly recommendation, allowing the possibility that from year to year 

he might reverse his decision. Now, in reference to the coin-tossing experi- 

ment, we have indicated that there is a problem of dovetailing the strat- 

egies from one toss to the next. That is, while for each sample size n we can 

describe an admissible class of strategies for the subject, it is not true in 

this case that an arbitrary selection from the class of admissible strategies 

for each n results in an admissible grand strategy prescribing the ac- 

tions thruout the whole course of the N tosses. What constitutes an admis- 

sible class of grand strategies for this as well as other cases has recently 

been found, and might have immediate applications to the kind of running 

experiment mentioned above. 

We observe that the subject in the coin-tossing experiment at each stage 

is making a prediction concerning the outcome of a random variable, 

namely, the result of the toss of a coin. But this is a special case of the 

general problem of prediction when formulated in decision theory lan- 

guage. More precisely, in a prediction situation, the experimenter has ob- 

tained a sample of observations from a distribution with an unknown @ 
and he is concerned not with @ itself, but rather with the outcome of a 
future observation from the distribution specified by it. The loss function 

in this case will usually depend on the difference between the value of the 
random variable which occurs and the predicted value. The general predic- 
tion problem is now being analyzed from the decision theoretic point of 
view and the final results will undoubtedly find applications in the field 
of social science in general and education in particular. 

One feature which the coin-tossing problem does not exhibit, but the 
more general problem of the two-armed bandit does, is that of design. 
In the two-armed bandit problem, with w+-0@ not necessarily equal to unity, 
the subject’s aim is to decide as soon as possible which arm has the larger 
probability of paying off. But he has several courses of action open to 
him in order to arrive at such a decision. He can pull the left arm for a 
specified period of time, or the right arm, or alternate between the right 
and left in some prescribed manner. What is the optimal design? Here once 
more we have a problem which is not mere idle curiosity. Very often in 
research, we have two alternative experiments available, each shedding 
light on a given problem, and we may want to know how many of each 
experiment to perform and in what order. Interest heightens in such prob- 
lems when the cost of performing the two experiments differs. 

To summarize, statistical decision theory is statistics looked upon from 
a broader and more unified point of view. To those who have mastered the 
concepts of decision theory, it gives a new way of looking at problems 
and furnishes new tools for attacking them. As yet there exists no general 
elementary treatise on statistical decision theory to which the reader can 
be referred. The only nontechnical discussion appears in a review by 
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Savage (4) of Wald’s book, Statistical Decision Functions (5). Wald’s 
book itself can be recommended to the mature mathematician. Blackwell 
and Girshick (1) provided a discussion of intermediate mathematical dif- 
ficulty and gave an extensive bibliography dealing with the technical 
aspects of decision theory as well as with game theory. The book Design 
for Decision by Bross (2) presents a philosophical discussion of the deci- 
sion problem, but does not consider statistical technics in the general 
framework of modern statistical decision theory. A review of the litera- 
ture on the economic theory of decision making and of psychological ex- 
periments relevant to it, together with a discussion of some of the concepts 
of game theory, was given by Edwards (3). 
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Suorr discussions of nonparametric methods have appeared within 
chapters of previous issues of the Review that dealt with research methods. 
Since this is the first treatment of nonparametric methods in a separate 
chapter, it seemed desirable to include references beyond the past three 
years when they were essential to the discussion. With a few exceptions 
these references were not included in the previous chapters of the 
Review. Emphasis, however, has been placed upon developments appearing 
within the past three years. 

Nonparametric or distribution-free statistical methods are those that 
make no assumptions about the nature of the distribution function in the 
population being sampled, other than in some instances that it is con- 
tinuous. In instances where the assumption of, or the transformation to, 
normality is not reasonable, nonparametric methods are appropriate. 
The advantages of these methods are: (a) no assumption is made as to 
the form of the parent distribution, except that in some cases it is assumed 
to be continuous, and (b) computation is generally simple and rapid. 
Since many of the variables studied in the behavioral sciences probably 
are not normally distributed, these methods are particularly useful in 
research in these fields. The greatest disadvantage of nonparametric 
methods is that they are not quite so efficient as parametric statistics 
when the distribution is normal or known. Some current confusions with 
respect to this point are discussed in a later section. 

The present chapter has two distinct divisions. The first part is con- 
cerned with developments most closely related to practical educational 
research work. It includes a count of applications of nonparametric 
methods in certain journals and a discussion of possible reasons for the 
infrequency of applications in education. A brief discussion of general 
summaries and bibliographies concludes the first part. The second part 
describes the present status of the groups of nonparametric methods thought 
likely to be of greatest usefulness in the behavioral sciences. 


Applications of Nonparametric Methods 


Twenty professional publications in the behavioral sciences thought 
most likely to report studies that used nonparametric methods were 
examined for the period January 1950 to July 1954. Every page of 
these journals was scanned for applications of nonparametric methods 
and 341 references were located.’ This list is not exhaustive, but it is 


* This work was done by Glade T. Wilcox and Alfred C. Koester. 
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believed to be representative of the kinds of studies that applied nonpara- 
metric methods. After examining these applications, the general impression 
of the authors is that nothing would be accomplished if these references 
were discussed in detail or even listed in the bibliography. There are no 
lessons to be learned or points to be made that cannot be made better by 
reading some of the statistical publications, especially some of those 
issued since 1953. 

The education journals examined included the Journal of Educational 
Psychology, the Journal of Experimental Education, the Journal of Edu- 
cational Research, Educational and Psychological Measurement, and 
Harvard Educational Review. The psychology journals included the 
American Journal of Psychology, the British Journal of Psychology, the 
British Journal of Statistical Psychology, Genetic Psychology Mono- 
graphs, the Journal of Abnormal and Social Psychology, the Journal of 
Applied Psychology, the Journal of Consulting Psychology, the Journal 
of Experimental Psychology, the Journal of Genetic Psychology, the 
Journal of Social Psychology, Psychological Bulletin, Psychological Mono- 
graphs, and Psychometrika. From the field of sociology the American 
Sociological Review and Sociometry were examined. 

Table 4 summarizes the frequency of such articles. These are only crude 
enumeration data and any comparison of relative frequencies would be 
misleading as well as unjustified. The data are, however, sufficient to 
indicate that chi-square is the most widely used technic of nonparametric 
statistical inference. 


TABLE 4.—FREQUENCY OF ARTICLES REPORTING USE OF A 
NONPARAMETRIC TEST IN 20 PERIODICALS FOR THE 
PERIOD JANUARY 1950 TO JULY 1954 














—— Number of articles reporting use of 

Field journals Test Estimation 

examined Chi-square statistics statistics 
1 2 3 4 5 
os 5 0) sock 5 23 2 5 
Psychology........... 13 189 64 28 
"EA 2 26 1 3 
We is ch ke aes he 20 238 67 36 





The four journals that contained the largest number of nonparametric 
statistical applications, other than the chi-square test, were the Journal of 
Experimental Psychology, the Journal of Abnormal and Social Psychol- 
ogy, the American Journal of Psychology, and Psychological Bulletin. 
When all education journals were combined, they did not contain as many 
nonparametric applications as any single one of these. There are probably 
many problems in education which are not commonly studied that could 
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profitably be attacked by nonparametric methods. There are also problems 
in education that are studied by parametric methods when the assumptions 
of these methods do not hold. Perhaps some of the reasons for this blind 
spot with reference to nonparametric methods are to be found in an 
examination of the educational statistics textbooks. 


Nonparametric Methods in Educational Research 


Educational research workers cannot be expected to apply methods 
they do not know about. Before 1953 there were no educational statistics 
books that seriously concerned themselves with nonparametric statistics. 
Most of those dealing with the second course in statistics and some dealing 
with the first course carried some discussion of chi-square and Spearman’s 
rank correlation. Dixon and Massey (11) and Mood (54) dealt with 
nonparametric methods, but these texts are somewhat more mathematical 
than texts educational research workers interested in applications are 
likely to read. The applications and theory, of course, were available 
in statistical journals, but these journals are far removed from the main 
group of educational research workers. 

Since 1953 the picture has changed markedly. The publications of 
Festinger and Katz (14), Goodman (24), Moses (57), Mosteller (60), 
and Walker and Lev (93) have made explanations of nonparametric 
methods available to those with limited mathematical training. The non- 
mathematical language and arithmetic examples provided make the dis- 
cussions easy to follow and hence should cause few mathematical neuroses 
among educational research workers. The excellent discussions of the 
place of statistical inference in statistics and of mathematical models by 
both Festinger and Katz, and Walker and Lev should give the research 
worker a basis for sound application of the methods. It is hoped that in 
the next few years the nonparametric methods will become better known. 

Another reason for the infrequent use of nonparametric methods can 
be traced to the attachment of certain textbook writers to parametric 
methods, possibly because of their elegant mathematical properties of 
efficiency when the assumptions of normality obtain. Even as careful a 
textbook writer as Lindquist (44: 89-90), who has been justifiably 
critical of others for loose writing, allows his efficiency bias to show. In 
discussion nonparametric methods, he indicates that there are some data 
which are not amenable to transformation and hence nonparametric 
methods may be used. He adds that “all of these tests are less powerful 
than those assuming normality and homogeneity of variance.” His parting 
advice is that the more powerful tests will continue indefinitely to be 
used in the majority of educational experiments, or wherever the assump- 
tions seem satisfied. 

In common with several other writers, Lindquist somewhat casually 
dismisses or overlooks several possibilities. Researchers often have no 
knowledge of the form of the distribution with which they are dealing, 
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or they may have so little knowledge of the mathematical function describ- 
ing the form of the distribution that they cannot properly apply a normal- 
izing transformation. Until recent developments in nonparametric methods 
there was little that could be done in most such cases except to make 
unwarranted assumptions. Whether situations of non-normality are fre- 
quent or rare depends on the nature of the problems studied, the experi- 
mental procedures, and kinds of data collected. If the status quo is main- 
tained in research and the same problems are attacked in the same way 
by the same technics, then the situation with reference to non-normality 
will remain as now. However, recent studies of problem-solving, or of 
trouble-shooting behavior, indicate that normality and homogeneity of 
variance do not necessarily obtain. Study of these and other higher mental 
processes may require some use of nonparametric methods. 


Rationale of Nonparametric Methods 


Nonparametric methods are based on a simple property of order sta- 
tistics; namely, that if we order a set of observations on a continuous 
variable which has a probability density function, the joint distribution 
of areas under the density function between the various observations of 
the ordered set is independent of the form of the original density function. 
Since the normal distribution is a density function, nonparametric methods 


are applicable to a set of observations on a normally distributed variable. 
However, nonparametric methods here are less efficient than methods which 
make use of the additional information on the form of the distribution. 
Until recently, the efficiency of a nonparametric test was determined by 
comparing the power of the test with the corresponding power of the 
test that would be used-if the variables were normally distributed. Mathe- 
matical statisticians are still not entirely clear as to how to evaluate 
efficiency in the case where the probability density is non-normal. If the 
distribution is non-normal, nonparametric tests are frequently more 
sensitive to differences than the corresponding test based on the faulty 
assumption of normality. An example from Festinger and Katz (14: 548) 
illustrates this point with hypothetical data. An attitude test was admin- 
istered to 22 subjects before and after the subjects studied race prejudices. 
The differences were 17 positive, 4 negative, and 1 zero. In these data 
the inappropriate (but efficient for normal samples) t-test yielded a 
nonsignificant result. The appropriate (but less efficient for normal 
samples) sign test rejected the null hypothesis at the 1-percent level. 

The meaning of power in this case can be illustrated by assuming that 
the distribution was normal and the t-test was applicable. The sign test 
would then have required 35 pairs of observations to equal the power of 
the ¢-test for 22 pairs. However, had the Mann-Whitney U statistic been 
used, 23 pairs of observations would have done the same job as the 
t-test with 22 pairs. 
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The point to be made is that instead of evaluating the efficiency of a 
nonparametric method in terms of the corresponding method which would 
be applicable if the variables were normally distributed, the researcher 
would do well to examine the appropriateness of the statistical model for 
the data and the problem. If the chance variable being observed is 
normally distributed, or if it can properly be transformed to a variable 
which is normally distributed, methods appropriate for the normal dis- 
tribution probably should be used. If these conditions do not obtain, or if 
considerations of cost of computations enter, the possibility of using 
nonparametric methods should be considered. 


General Summaries and Bibliographies 


The educational worker interested in well-written and easy-to-follow 
discussions of nonparametric methods should refer to Festinger and Katz 
(14), Goodman (24), Moses (57), Mosteller (60), and Walker and Lev 
(93). While the researcher may find a useful introduction to the more 
common nonparametric test and estimation statistics in these references, 
the vast bulk of references is still carried in mathematical statistics journals 
such as the Annals of Mathematical Statistics. 

The most comprehensive bibliography of nonparametric methods cur- 


rently available was prepared by Savage (74), who listed 999 references 
under the following headings: 


Surveys and Discussions 
Theory 

Tchebycheff Inequalities 
Tolerance Sets 

Goodness of Fit 
Multisample Problems 
Parameter Problems 
Contingency Tables 
Randomness 

Correlation and Curve Fitting 
Comparative Studies 
Systematic Statistics 
Scaling 

Distribution Theory 
Applications 

Tables 


Miscellaneous 


MVOZEP ASM RMOMAOOM> 


Summaries have been prepared by Scheffé and Tukey (75), Wilks (99), 
and Wolfowitz (102). These summaries are written in mathematical lan- 
guage. Hoel’s (28) revised edition of his text includes a chapter on non- 
parametric methods. 


471 











Review or EpucaTIONAL RESEARCH Vol. XXIV, No. 5 





Tests of Goodness of Fit 


The tests considered in this section are applicable in problems in which 
the question at issue relates to the form of the probability distribution in 
the population from which a particular set of observed data has been 
drawn as a sample. The test that the population distribution is of a speci- 
fied form is often called a test of goodness of fit. This is a nonparametric 
problem in the sense that, if the probability distribution is not the one 
specified by the hypothesis, the various alternative distributions constitute 
a family which is so large that it cannot be described in functional form 
with one or more unknown parameters. 

Undoubtedly, the best known as well as the oldest among the nonpara- 
metric goodness of fit tests is the chi-square test, which was devised by 
Pearson (68). Because it is so well-known, we shall not describe the test 
in detail but instead refer the reader to the excellent article by Cochran 
(9), in which many of the uses and abuses of the test are described. The 
chi-square test possesses an advantage when compared to the test to be 
considered below. Suppose the population distribution is not completely 
specified but depends upon one or more unknown parameters. Then, if 
one uses the minimum chi-square method to estimate the parameters, the 
large sample theory of chi-square remains valid provided certain condi- 
tions are satisfied. For details, see Cochran’s article (9). 

In the remainder of this section it will be assumed that we are dealing 
with a continuous chance variable, and that the family of possible prob- 
ability distributions is such that the cumulative distribution functions are 
continuous. From this assumption, it follows that if G(x) is the probability 
that an observed value of the variable will be less than or equal to x, and 
we consider the transformation of any set of observed values X,, X2,... Xn, 
to a set of values Y,, Y,,... Y, by the relationship Y;=G(X,), the sample 
values Y,...., Y, are independently and uniformly distributed between 
0 and 1. The distribution theory of all tests considered here is based on 
this fact. 

For any set of observed data, the value of the sample cumulative dis- 
tribution function F, (x), corresponding to each possible value x of the 
chance variable under consideration, is defined as the proportion of the 
sample values which do not exceed x. A test devised by Kolmogorov (37) 
considers all possible discrepancies between values of the cumulative dis- 
tribution function specified by the hypothesis and corresponding values 
of the sample cumulative distribution function. The test statistic D, is 
the largest of these discrepancies in absolute value multiplied by the square 
root of n. The hypothesis is rejected when D, is too large. In order to 
apply the test, the distribution of D, is needed. Kolmogorov obtained a re- 
cursion formula which enables one to compute the distribution of D, for any 
sample of size n. Tables of this distribution were computed for various 
values of n and several levels of significance by Birnbaum (5) and by 
Massey (50, 51). Kolmogorov also obtained an asymptotic formula which 
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provides an approximation to the distribution of D, for large values of n, 
and the approximation was tabulated by Smirnov (80). If the hypothesis 
is true, the distribution of the statistic is independent of the form of the 
specified probability distribution. Thus the test is nonparametric. 

A generalization of D, which enables the researcher to obtain tests sensi- 
tive to particular kinds of alternatives was considered by Anderson and 
Darling (1). However, the distribution theory involved is still in a highly 
complex stage, mathematically speaking, and no tables are available as yet. 

Another test based on the sample distribution function was considered 
by Smirnov (79). The test statistic here is called W,,* and is based on the 
sum of the squares of the discrepancy between a value of the sample cumu- 
lative distribution function and the corresponding value of the specified 
cumulative distribution function. The hypothesis that the probability dis- 
tribution has the specified form is rejected when W,,” is too large. Smirnov 
showed that, if the hypothesis is true, the distribution of W,,* does not 
depend on the specified form of the cumulative distribution function, and 
he obtained the large sample distribution of W,°. A table of that distribu- 
tion was given by Anderson and Darling (1) who also generalized W,,” 
in a manner similar to their generalization of D,. 

Consider again the transformation of the set of observed values X,....Xy 
to a set of values Y,,... Y, by the relation Y;=G(X;), where G(x) is the 
probability under the hypothesis tested that a value in the original set will 
be less than or equal to x. Suppose these transformed values are arranged 
in order of increasing magnitude. (It will be observed that the order of the 
observations is the same in both the original and the transformed set.) 
Then, if the hypothesis is true, the average value of the ith ordered value 


; Ta: 
in the transformed set is Ral’ and consequently the average absolute 


n+ 


value of the difference between two ordered values in this set is oat Thus, 
the hypothesis that the probability distribution is the one specified can be 
tested with the use of statistics which compare the observed and expected 
values of differences in the ordered set of transformed observations. Such 
tests were proposed by Kimball (36), Moran (56), and Sherman (76). 

In many situations the researcher has some idea about alternative hypo- 
theses. In such situations it is sometimes possible to devise a test which is 
more sensitive to particular alternatives than the kind of “general purpose” 
test described above. An example is the “smooth” test proposed by Ney- 
man (63). For an illuminating paper discussing these ideas see Birnbaum 


(4). ‘ 
Comparison of Two Samples 


The tests considered in this section are applicable in problems in which 
two samples are to be compared in order to test the hypothesis that the 
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populations from which they were drawn have the same distribution. A 
great deal of work has been done on this problem. 

An early and easily applied test for the two-sample problem was given 
by Thompson (84). The test statistic T,,,, is defined as the number of 
values in the first sample which do not exceed the kth value in the second. 
where k is an integer between 1 and n inclusive. One advantage of the test 
is that the distribution of 7, is easily derived in terms of any given pair 
of probability distributions. Thus, the power of the test can be computed 
(in theory at least) for any pair of alternatives. The general idea of Thomp- 
son’s test is that when the hypothesis is true and the two samples come 
from the same population, the distribution of the number of values in the 
first sample which are less than the Ath value in the second, or which lie 
between any two of the ordered values in the second, is independent of the 
common probability distribution. Tests based on the same idea were sug- 
gested by Dixon (10) and by Mathisen (52). 

A similar idea was employed in a test given by Westenberg (95). Sup- 
pose the two sets of values are arranged in a single ordered sequence. The 
test statistic 7, is now the number of values in the first sample which 
do not exceed the kth value in the combined set. One choice for k would 
be such that the corresponding value is the median of the combined sample. 

A test which has been quite widely used in the last few years was 
originally suggested by Wilcoxon (97), and in slightly different forms 
by Festinger (13), and by Mann and Whitney (46). Again arrange the 
two sets of values into a single ordered sequence. The test statistic 7, , 
is then defined as the number of pairs which can be formed by taking 
each value in the first set and pairing it in turn with each value in the 
second set which exceeds it. Assuming the truth of the hypothesis, Mann 
and Whitney (46) derived the distribution theory. They presented recur- 
sion formulas by means of which the distribution of 7,,,, can be calculated 
for any pair of integers m and n, and they also derived an asymptotic for- 
mula for the large sample distribution of 7,,,,. A table of significance levels 
of the distribution was included. This table was extended by Auble (3). 
Marshall (47) presented a large sample test for the hypothesis that one of 
two random variables is stochastically larger than the other. 

Another well-known test for the two-sample problem is the run test due 
to Wald and Wolfowitz (91). Let both sets of observations be arranged 
in a single ordered sequence. A run of values from the first set is then an 
uninterrupted subsequence consisting of values from the first set only, and 
a run of values from the second set is similarly defined. The test statistic 
Tm.» is then defined as the totality of runs, with small values of 
Tm.» considered significant. The idea here is that we would expect the or- 
dered sequence to produce a reasonably good intermixing of values from 
the two sets when the hypothesis is true. Wald and Wolfowitz obtained 
the small sample as well as the asymptotic distribution of T,,, when the 
hypothesis is true. Tables of significant values for various values of m and 
n were computed by Swed and Eisenhart (83). 
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Suppose values of sample cumulative distribution functions are deter- 
mined for each of the two sets of sample values, and consider all possible 
discrepancies between corresponding values of these two functions. Smirnov 
(78) investigated the large sample distribution of the statistic T,,,, where 
Tm,» is the largest of these discrepancies in absolute value multiplied by 


mn : SP 
the sauare root of eae A method for computing the distribution of Tn,, 





for m and n small and a table of significant values for various sample sizes 
was given by Massey (48, 49). 

A method of obtaining nonparametric tests based on permutations of 
the observations was first suggested by Fisher (15) and was explored in 
a very general way by Pitman (70). The idea here is that, if we are given 
the values of the two combined samples without knowing which of the 
observations were from the first set and which were from the second set, 
any choice of m of them as a possible first set is as likely as any other, 
provided the hypothesis holds. Consequently if T,,, is any function of 
m-+-n variables, which is symmetric separably in its first m variables and 
its last n variables, the different values that 7’,,,, can assume by choosing m 
of the sample values as a possible first set and the other n as a possible 
second set are all equally likely. In particular, Pitman considered all pos- 
sible subdivisions of the set of combined observations into two sets of size 
m and n respectively, and proposed as a test statistic the difference in the 
mean value of the two sets. The computation involved in applying Pitman’s 
test becomes quite tedious, even for moderate values of m and n, and hence 
asymptotic formulas are desirable. Such formulas were provided by Wald 
and Wolfowitz (92) and generalized by Noether (65). 

Other tests are available for special situations. Suppose m=n, and fur- 
thermore, suppose the observations occur in pairs. Then one of the easiest 
tests to apply is the sign test which simply counts the number of pairs in 
which the observation from the first set exceeds the corresponding observa- 
tion from the second. A detailed discussion of the sign test was given by 
Dixon and Mood (12). 

In conclusion two additional facts should be mentioned. One concerns 
the hypothesis of symmetry. Suppose we have a sample of n observations 
and we wish to test the hypothesis that the probability distribution of values 
in the population is symmetric about zero. This is equivalent to testing the 
hypothesis that the observed positive values and the absolute values of the 
observed negative values are two samples from a population with the same 
probability distribution, and any of the tests discussed above can be applied. 

Some of the two-sample procedures discussed here can be generalized 
to test the hypothesis of equality of more than two distributions. Such is 
the case with the run test; the necessary distribution theory was given by 
Mood (53). A generalization of the Mann-Whitney test which applies to 
three distributions was given by Whitney (96). Other multi-sample tests 
are discussed in the following section. 
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Analysis of Variance Tests 


In discussing nonparametric analysis of variance tests, we consider two 
possible problems: (a) We are given k sets of observed values and we want 
to test the hypothesis that they are all from the same parent population. 
This is analogous to the classical one-criterion analysis of variance where 
it is further assumed that the common distribution is normal. (b) We are 
given a set of observations which are subject to a two-way classification. 
We want to test the hypothesis that the first of these classifications, say, has 
no effect; that is, the set of all values constituting a single group under 
the second classification are from the same population. 

Two tests applicable to the first problem will be described. The first 
is the H test given by Kruskal and Wallis (39). Let N be the total number 
of observations in the k samples. All the observations are combined in a 
single ordered sample. Each observation is then replaced by its rank or 
position in the combined sample, the smallest of all observations obtaining 
the rank 1. Assuming there are no tied observations, H is defined as a 
function of the square of the sum, for each separate set, of ranks in the 
combined set divided by the number of observations in the set and summed 
over the & sets. Kruskal and Wallis gave a detailed discussion of the H 
test, including a prescription for defining H in case ties are present. Kruskal 
(38) proved that when the number of observations in each of the k sets 
is reasonably large, the distribution of H is approximately chi-square with 
k—1 degrees of freedom, provided the hypothesis is true. Tables were in- 
cluded for the case k—=3, when each set has five observations, and some 
other methods for approximating the distribution of H under the null 
hypothesis were discussed. 

In a test devised by Mood and Brown and reported by Mood (54), 
all samples are combined and the median of the combined group is found; 
then a 2k contingency table is set up in which is recorded for each of the 
k samples the number of individuals exceeding the median of the com- 
bined samples and the number not exceeding that median. It follows from 
well-known results on contingency tables that the hypothesis under con- 
sideration can be tested by a statistic whose distribution when the hy- 
pothesis is true is approximately chi-square with kK—1 degrees of freedom 
if the number of observations in each of the & sets is moderately large. 

In connection with the second problem, Friedman (23) proposed a test 
based on the ranks of the values observed for each group under the second 
classification. Brown and Mood (7) devised a class of tests based on 
medians which can be applied to analysis of variance problems. 


Correlation Methods 


The earliest nonparametric correlation coefficient is the familiar rank 
order coefficient proposed by Spearman (81). This coefficient is the ordi- 
nary correlation coefficient with the observations replaced by their ranks. 
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Kendall (32) introduced another measure of rank correlation, and he (33) 
also collected a large number of results on various methods of rank corre- 
lation, including t, W, partial t, and measures of consistency and of agree- 
ment based upon paired comparison data. 

The hypothesis of randomness is the hypothesis that the observations in 
a sample are independently and identically distributed. Three methods have 
been used to test the hypothesis of randomness. One is based on rank 
correlation coefficients, and the basic reasoning here is that under the 
hypothesis of randomness, the ranks of the observations are uncorrelated 
with the integers 1, 2, .. ., n. Mann (45) investigated the Kendall rank 
correlation coefficient as a test for randomness. 

A second method of testing for randomness is based on serial correlation, 
which is the correlation between an element of the sample and its successor. 
Two such tests were given, one by Wald and Wolfowitz (90) and one by 
Young (104). 

There is a series of tests based on runs. Two types of ruas may be con- 
sidered, runs above and below the median, and runs up and down. In the 
first case each observation is replaced by a 1 or a 0 according to whether 
the observation is above or below the median of the sample and the test 
is based on the properties of the runs of 1’s and 0’s. An example of such 
a test is that of Mosteller (58) who based his test on the longest run of 1’s. 

In the second case each pair of observations (X;, X;,,) is replaced by a 
1 or a 0 according to whether X;< X;,; or X;>Xj,:, and the test is again 
based on the runs of 1’s and 0’s. Such tests were considered by Fisher (16) 
and Kermack and McKendrick (34). Much of the distribution theory was 
worked out by Levene and Wolfowitz (43), Olmstead (66), and Wolfo- 
witz (101). In a recent paper Levene (42) considered a number of such 
tests, with emphasis on the problem of how to choose between them. 

Let (X,,¥,)....,(Xun,¥n) be a sample from a bivariate distribution. The 
hypothesis to be tested states that the two variables X and Y are inde- 
pendent. The earliest tests of independence were based on correlation co- 
efficients. Hotelling and Pabst (29) considered Spearman’s rank coef- 
ficient: Kendall (32) considered his rank coefficient; and Pitman (71) 
considered a test based on the ordinary correlation coefficient. An inter- 
esting test of the hypothesis that two variables are independent, which is 
somewhat too complex to describe here, was proposed by Olmstead and 
Tukey (67), who also derived the necessary distribution theory. Other 
tests based on ranks were proposed by Hoeffding (26) and Lehmann (40), 
and certain desirable properties of these tests were demonstrated. 


Properties of Nonparametric Tests 


In the preceding sections .we have described a number of tests designed 
to test various hypotheses. The one feature common to all of them is that 
the exact or approximate distribution of the test statistic could be computed, 
provided the hypothesis was true. This was necessary in order that the test 
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could be applied. But this does not solve the problem of how to choose 
between several tests designed to test the same hypothesis, a problem which 
is solved in many parametric cases by the Neyman-Pearson theory. Because 
of certain mathematical properties, it is easy to construct a number of 
different tests of the same size, provided the underlying distribution func. 
tions are assumed to be continuous. What properties, then, should we 
expect a good test to possess? While there is no unified practical theory 
available as yet, we shall discuss several properties which seem reasonable, 
as well as a method of comparing two tests, provided the sample sizes are 
large. 

The first of these properties is the property of consistency. A test (or 
more properly a sequence of tests) is said to be consistent for a given al- 
ternative hypothesis if for large samples the probability of rejecting the null 
hypothesis is close to unity when the alternative hypothesis is true. It has 
been suggested by Wolfowitz (100) and Lehmann (40) that the least one 
should expect from a good test is that it be consistent against the particular 
class of alternatives for which it was designed. A number of tests men- 
tioned earlier are consistent against large classes of alternatives, for ex- 
ample, the Wald-Wolfowitz run test, the Mann-Whitney test, the Kolmo- 
gorov goodness of fit test. 

Another reasonable property that we might expect a good test to have is 
that it be unbiased for the class of alternatives which is of interest. A test 
is defined as unbiased for a given alternative if the probability of rejecting 
the null hypothesis exceeds the level of significance when the alternative 
hypothesis is true. A method for showing unbiasedness for a large class of 
tests was given by Lehmann (40). This remarkable paper included several 
other results: a method for proving consistency of certain tests; a theorem 
based on a result of Hoeffding (25) for obtaining the asymptotic distribu- 
tion of a number of test statistics when some alternate hypothesis is true: 
and tests for the two-sample problem and the hypothesis of independence. 
Both these tests have the property of being consistent and unbiased against 
all alternatives. 

Pitman (69) developed the concept of asymptotic relative efficiency 
of one consistent test with respect to another. Roughly speaking, the asymp- 
totic relative efficiency of one test with respect to another is the reciprocal 
of the limiting ratio of the sample sizes necessary to achieve a fixed prob- 
ability of rejecting the hypothesis when an alternative which in a certain 
sense is close to the hypothesis actually is true. The precise definitions 
are too complex to give here, but it may be pointed out that this concept 
seems to be particularly useful in comparing nonparametric tests with the 
optimum parametric tests for normally distributed variables, altho the 
concept has been used in other situations. 

Some rather surprising results have been obtained. For example, Mood 
(55) showed that in the two-sample problem when the distributions 
involved are assumed to be normal, the asymptotic relative efficiency of 
the median test compared to the t-test is 2/x and that of the Mann- 
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Whitney test compared to the t-test is 3/x or about .95. This seems to 
indicate that the Mann-Whitney test compares quite favorably with the 
t-test, even in those situations when the t-test is known to be optimal. 
Similar results were obtained by Andrews (2) for the Kruskal-Wallis H 
test and the Mood-Brown test based on medians, when compared with 
the classical analysis of variance test for a one-way criterion. Asymptotic 
relative efficiencies of various tests of the hypothesis of randomness 
were given by Noether (64) and in a recent paper by Stuart (82). 

This section closes with two remarks. At the beginning of this section 
it was stated that there is not as yet a general method which yields optimum 
nonparametric tests for a given hypothesis and a given class of alterna- 
tives. However, two papers have made a beginning in the solution of this 
difficult problem. One is by Wolfowitz (100), in which a modified likeli- 
hood ratio procedure is defined which yields optimum rank tests under 
certain restrictions. The other is a paper by Lehmann (41), in which 
nonparametric classes of alternatives are defined and optimum tests against 
these alternatives are derived by using a theorem of Hoeffding (27). Some 
of the standard tests were shown to have optimum properties against 
certain classes of alternatives. 

One of the basic assumptions underlying almost all nonparametric tests 
is the assumption of continuity of the cumulative distribution functions 
involved. In practice this is a very serious restriction, since in many 
experimental situations tied observations will occur. Some of the papers 
discussed offer recipes for breaking ties without much theoretical justi- 
fication for these recipes. A general method for breaking ties with solutions 
for some of the distributional problems involved was given by Putter (72). 


Nonparametric Discrimination 


The problem of assigning an individual to one of two populations on 
the basis of k measurements recorded for him is customarily called dis- 
crimination. If the joint probability distribution of these k measurements 
in each of the two populations is known, the problem has a relatively 
simple optimum solution, based on the Neyman-Pearson likelihood ratio 
procedure; this problem was solved by Welch (94). 

In many applications, it has been assumed that the probability distribu- 
tions in the two populations are multivariate normal with unknown means 
and equal but unknown variance-covariance matrices. It is further 
assumed that there are available a sample of m observations known to have 
come from the first population and a sample of n observations known 
to have come from the second. Fisher (17) introduced a procedure, known 
as the linear discriminant function, which proceeds as follows: The 
unknown parameters (mean vectors and variance-covariance matrices) 
are estimated on the basis of the observations; the estimated parameters 
determine the two distributions completely, and the likelihood ratio pro- 
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cedure is employed. The procedure appears to be quite reasonable, pro- 
vided the assumptions of normality and equal variance-covariance matrices 
are not too badly violated. 

However, in many problems it is clear that these assumptions cannot 
be justified, and there is need for procedures which are free from assump- 
tions about the functional forms of the distributions. Such a class of 
procedures was discussed in two papers by Fix and Hodges (18, 19). The 
idea of the Fix-Hodges procedure is relatively simple. An integer j is 
chosen which is small compared with the two sample sizes. After the 
measurements for the individual to be classified are recorded, the j obser- 
vations in the combined sample which are nearest to the new observation 
(in some sense) are found. Let M be the number of these which belong to 
the sample from the first population and N=j—M the number which belong 
to the sample from the second. The ratios M/m and N/n are then used 
to estimate the corresponding probability densities for each of the two 
populations, and the likelihood ratio procedure is used on these estimates. 

Fix and Hodges (18) showed that this procedure is almost as good 
as the likelihood ratio procedure for the case in which the two probability 
densities are known, provided the two sample sizes are large and some 
other weak assumptions hold. They (19) also compared some of the 
nonparametric procedures with the linear discriminant function, when 
the sample sizes are relatively small and when the assumptions of nor- 
mality and equal variance-covariance matrices hold. 


Nonparametric Estimation 


Consider the following problem. We have a sample of observations on a 
chance variable whose associated probability distribution is assumed to 
belong to a given class but is otherwise unknown. We would like to char- 
acterize this distribution by means of a descriptive number, or a series 
of descriptive numbers, which are defined for each distribution in the 
class; for example, such numbers might be the first k moments of the 
distribution or the median or other quantiles. Since the actual distribution 
is unknown, we must estimate these numbers with the use of the sample. 

In the traditional problems of estimation, the class of distributions to 
which the unknown distribution belongs is assumed to have a given func- 
tional form depending on one or more parameters. Here the basic problem 
has been the problem of estimating the parameters since any descriptive 
numbers for the distribution are functions of these parameters, and, in 
general, can be estimated by the corresponding functions of the estimates. 
However, the problems of point and interval estimation and the deter- 
mination of tolerance limits have recently been considered in the more 
general framework of nonparametric inference, that is, the class of distribu- 
tions to which the unknown distribution belongs is not restricted to a 
given parametric family. The results so far obtained are considered here. 

A point estimator is a function of the sample values which gives us a 
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unique number as an estimate of the descriptive number in which we are 
interested. A point estimator (or more precisely a sequence of point esti- 
mators) is said to be consistent provided that the probability that the 
estimated value differs from the true value by more than any given amount 
converges to zero as n becomes large. 

For any set of n observations, we define the sample cumulative distribu- 
tion function as before. It is intuitively evident that a reasonable estimate 
for a descriptive number defined for the population cumulative distribution 
function is the corresponding number determined for the sample cumu- 
lative distribution function, with necessary modifications when the number 
determined for the sample is not unique. For example, if we wish to esti- 
mate the population mean, it is clear that the sample mean is a reasonable 
estimate. From well-known theorems in probability theory, it follows that 
in many cases these estimators are consistent and are also maximum like- 
lihood estimators. 

A method of estimation based only on functions of the order statistics 
was considered by Mosteller (59), who showed that in certain special 
cases these estimators are almost as good as the “best” possible estimators, 
when “goodness” is measured in terms of variance. 

The method of point estimation has its obvious drawback, for rarely, 
if ever, is our estimate going to be correct, and it may be quite far from 
the correct value with large probability. The method of estimation based 
on confidence intervals was devised by Neyman (62) and proceeds as 
follows: A number, a, known as the confidence coefficient, is chosen 
with 0<a<1. The problem is to find a pair of functions of the sample, 
constituting a “random” interval, having the property that the probability 
that the random interval covers the value to be estimated is at least a. 

A method of finding confidence intervals for medians and other quantiles 
when the chance variable is continuous and has a continuous cumulative 
distribution function is based on the fact observed previously that is, 
in this case if F(x) is the probability that an observed value of the variable 
will be less than or equal to x, and we consider the transformation of any 
set of observed values X,, X2,..., Xn to a set of values Y,, Y.,..., Yn by 
the relation Y; = F(X;), the sample values Y,, Y2,..., Y» are independ- 
ently and uniformly distributed between 0 and 1. Thompson (85) first 
noted that this fact could be used to obtain confidence intervals. The 
method in the case of medians will be described briefly. Suppose the set 
of sample values is ordered according to increasing magnitude. Then the 
probability that the kth ordered value does not exceed the unknown 
median and simultaneously the (n—k-+1)st ordered value is not smaller 
than the median, is independent of the probability distribution in the 
population. Hence we may choose the largest k for which this probability 
is not smaller than the confidence coefficient. Our confidence interval is 
then the interval from the kth ordered value to the (n—k-+-1)st ordered 
value. We clearly want k as large as possible, for in that case the interval 
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is as short as possible, which is desirable. A table listing the largest 
possible k, for n = 6,..., 81, and a = .95 and .99 was compiled by Nair 
(61), who also gave large sample formulas appropriate for n larger 
than 81. A similar method, based on two samples, can be used to obtain 
confidence intervals for the difference of two unknown medians. 

Frequently it is desirable to obtain a picture of the entire cumulative 
distribution function in the population. One method of doing this is 
to obtain, on the basis of the sample, a “random belt” in the plane, 
having the property of covering the true distribution with probability 
at least equal to the confidence coefficient. A general solution to this prob- 
lem was given by Wald and Wolfowitz (89). A detailed description of the 
method can be found in a paper by Wolfowitz (102). 

The problem of tolerance limits was first posed by Shewhart (77) as 
a problem arising in industrial quality control, but it also arises in many 
areas of application. If the probability distribution in the population is 
known, it is easy to find two numbers such that the proportion of the 
distribution between these two numbers is at least a given amount. Suppose, 
however, that the population distribution is unknown, and the only source of 
information is a set of sample values. Can we determine two numbers 
from the sample such that with probability greater than some preassigned 
value, the proportion of the distribution between these two random num- 
bers is at least the given amount? Further, we should want the distribu- 
tion of this proportion to be independent of the probability distribution 
in the population when this distribution belongs to some large class of 
probability distributions. The problem was solved by Wilks (98) for the 
class of one-dimensional distributions possessing continuous derivatives 
and the solution was extended to the class of all one-dimensional dis- 
tributions by Scheffé and Tukey (75). Wald (88) solved the corresponding 
problem for continuous multidimensional distributions, and further ex- 
tensions in the multivariate case were given by Fraser (20, 21), Fraser 


and Wormleighton (22), and Tukey (86, 87). 


Stochastic Approximation Methods 


The stochastic approximation method of estimation is a new one which 
promises to have interesting applications in various fields. It has been 
explored in the case of regression problems and in the case of the expected 
value of response to a certain experiment at a specified level. 

In the usual regression analysis, it is assumed that the expected value 
of a chance variable Y depends upon a known variable X and certain 
unknown parameters that we want to estimate on the basis of a sample. 
Frequently we are interested in linear regression functions, and the method 
of least squares has been used to yield estimates of the unknown regression 
coefficients. Instead of trying to estimate the parameters of a linear 
function, we may instead try to estimate the value of the variable X when 
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the regression function has a given value, and no assumptions are made 
about the form of the function. 

The expected value of a response to a certain experiment at a specified 
level can be similarly evaluated. It is assumed that this value is a function 
of the level, but the form of the function is unknown to the experimenter. 
It is desired to find the level corresponding to a given expected value of 
the response. Robbins and Monro (73) proposed a method for making 
successive experiments in such a way that the level at which the experi- 
ments are carried out approaches the desired level in probability. Their 
results were generalized by Wolfowitz (103) and further generalized by 
Blum (6) and Kallianpur (30). Similar results were obtained by Kiefer 
and Wolfowitz (35) who devised a method to estimate the level for which 
the expected value of the response is a maximum. 

Much research concerning these methods is going on at present. In 
particular, mention should be made of a paper by Chung (8) in which 
there is obtained the asymptotic distribution of the levels at which experi- 
mentation is successively carried out. 
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Pirie, Duncan A. S., Head, Exact Sciences Department, Detroit Public Schools, 
4628 Devonshire Road, Detroit 24, Michigan. 

Pitkin, Fred E., Research Director, Massachusetts Teachers Association, 14 Beacon 
Street, Boston 8, Massachusetts. 

Plaumlee, Lynnette B., Director, Test Development Department, Educational Testing 
Service, Box 592, Princeton, New Jersey. 

Polley, John W., Associate Professor, Teachers College, Columbia University, New 
York 27, New York. 

Polster, Arthur H., Assistant Superintendent, Sacramento City Unified School Dis- 

trict, P. O. Box 2271, Sacramento 10, California. 
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Potter, Mary A. (Retired), 1533 College Avenue, Racine, Wisconsin (formerly 
Consultant in Mathematics, Board of Education, City Hall, Racine, Wisconsin) . 
Potter, Muriel C. (Mrs. Harry Langman), Associate Professor of Education, 
Michigan State Normal College, Pierce Hall, Ypsilanti, Michigan. 

Potthoff, Edward F., Director, Bureau of Institutional Research, University of 
Illinois, 1114 West Green Street, Urbana, Illinois. 

Pounds, Ralph L., Professor of Education, Teachers College, University of Cincin- 
nati, Cincinnati, Ohio. 


— Jackson O., Dean, College of Education, University of Wichita, Wichita 14, 
nsas. 


Powell, Marvin, Assistant Professor of Education, Division of Education, Western 
Reserve University, Cleveland 6, Ohio. 

Prescott, George A., Test Editor, Division of Test Research and Service, World 
Book Company, 313 Park Hill Avenue, Yonkers 5, New York. 

Pressey, Sidney L., Professor of Psychology, Ohio State University, Columbus, Ohio. 

Preston, Ralph C., Professor of Education, University of Pennsylvania, Philadelphia, 
Pennsylvania. 

Price, Robert Diddams, Associate Professor of Education, Teachers College, Uni- 
versity of Cincinnati, Cincinnati 21, Ohio. 

Pugmire, D. Ross, Professor of Education, University of Oklahoma, Norman, 
Oklahoma. 

Purdy, Ralph D., Associate Professor, School Administration, Marshall College, 
Box 25, Huntington, West Virginia. 

Quattlebaum, Charles A., Principal Specialist in Education, Legislative Reference 
Service, The Library of Congress, Washington 25, D. C. 

Quigley, Eileen E., Chairman, Department of Home Economics, Southern Illinois 
University, Carbondale, Illinois. 


Rabinowitz, William, Research Associate, Bank Street College of Education, 69 
Bank Street, New York 14, New York. 


Rankin, Paul T., Assistant Superintendent of Schools, Detroit, Michigan. (Presi- 
dent of AERA, 1933-34.) 

Rasmussen, Elmer M., Dean, Dana College, Blair, Nebraska. 

Raymond, Dorothy, Colby College, Waterville, Maine. 

Read, John G., Professor of Education, The Science Education Center, School of 
Education, Boston University, 332 Bay State Road, Boston 16, Massachusetts. 

Reals, Willis H., Dean, University College, Washington University, St. Louis 5, 
Missouri. 

Reavis, William C., Professor Emeritus of Education, University of Chicago, Chicago, 
Illinois. 

Redd, George N., Dean, College of Higher Studies, Fisk University, Nashville 8, 
Tennessee. 

Reed, Homer B., Professor of Psychology, Fort Hays Kansas State College, Hays, 
Kansas. 

Reeves, Floyd W., Consultant to the President, Michigan State College, Administra- 
tion Building, East Lansing, Michigan. 

Reid, Howard T., Associate Professor of Personnel and Guidance, Brigham Young 
University, Provo, Utah. 


Reid, Jackson B., Assistant Professor of Educational Psychology, University of 
Texas, Sutton Hall 311, Austin, Texas. 

Rein, William C., Special Assistant for Operations, Office of Assistant Administrator 
for Vocational Rehabilitation and Education, Veterans Administration, Munitions 
Building, Washington, D. C. 

Reiner, William B., Research Assistant, Bureau of Administrative and Budgetary Re- 
search, Board of Education of the City of New York, 110 Livingston Street, 
Brooklyn 1, New York. 

Reinhardt, Emma, Head, Department of Education and Psychology, Eastern Illinois 
State College, Charleston, Illinois. 

Reitz, William, Professor of Educational Evaluation, Statistics, and Research, Ex- 
aminer, College of Education, Wayne University, Detroit 1, Michigan. 

Remmers, H. H., Director, Division of Educational Reference, Purdue University, 
Lafayette, Indiana. (President of AERA, 1954-55.) 
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Remmlein, Mrs. Madaline Kinter, Assistant Director, Research Division, National 
Education Association, 1201 Sixteenth Street, N. W., Washington 6, D. C. 

Reusser, Walter C., Dean, Division of Adult Education and Community Service and 
Professor of Educational Administration, University of Wyoming, Laramie, Wyoming. 

Rhum, Gordon J., Associate Professor of Education, lowa State Teachers College, 
Cedar Falls, Iowa. 

— aera H., Editor, The Nation’s Schools, 919 North Michigan Avenue, Chicago, 

, Illinois. 

Richardson, H. D., Vicepresident, Arizona State College, Tempe, Arizona. 

— Herman G., Professor of Education, University of Chicago, Chicago 37, 

inois. 

Richter, Charles O., Assistant Superintendent, Newton Public Schools, Newton- 
ville 60, Massachusetts. 

Rinsland, Henry D., Professor of Education, University of Oklahoma, Norman, 
Oklahoma. 

Rivlin, Harry N., Chairman, Department of Education and Director of Graduate 
Studies, Queens College, Flushing, New York. 

—_ Irving, Assistant Professor of Education, Queens College, Flushing, New 
York. 

Robertson, M. S., Research Professor of Education and Associate Director, Bureau 
of Educational Research, College of Education, Louisiana State University and 
A and M College, Baton Rouge 3, Louisiana. 

Robinson, Helen M., Associate Professor, Department of Education, University of 
Chicago, Chicago, Illinois. 

Roca, Pablo, Director, Technical Division, State Department of Education, Hato Rey, 
Puerto Rico. 

Rodriguez-Bou, Ismael, Permanent Secretary, Superior Educational Council, Uni- 
versity of Puerto Rico, Rio Piedras, Puerto Rico. 

Rose, Ella J., Professor of Home Economics Education, School of Home Economics, 
University of Minnesota, St. Paul 1, Minnesota. 

Ross, Donald H., Associate Professor, Teachers College, Columbia University, 
New York 27, New York. 

Ross, Maurice James, Education Consultant, Connecticut State Department of Edu- 
cation, Box 2219, Hartford, Connecticut. 

Rothney, John W. M., Professor of Education, University of Wisconsin, Madison 6, 
Wisconsin. 

Rugen, Mabel E., Professor of Health Education, School of Education and School 
of Public Health, University of Michigan, Ann Arbor, Michigan. 

Rugg, Earle U., Chairman, Division of Education, Colorado State College of Edu- 
cation, Greeley, Colorado. 

Rulon, Phillip J., Professor of Education, Harvard Graduate School of Education, 
Cambridge, Massachusetts. 

Rummel, J. Francis, Assistant Professor of Education, University of Oregon, Eugene, 
Oregon. 

Rundquist, Richard M., Assistant Professor of Education, Guidance Bureau, Uni- 
versity of Kansas, Lawrence, Kansas. 

Russell, David H., Professor of Education, University of California, Berkeley, 
California. 

Russell, James E., Associate Professor of Education, Teachers College, Columbia 
University, New York 27, New York. 

Russell, John Dale, Chancellor and Executive Secretary, Board of Educational 
Finance, State of New Mexico, P. O. Box 1616, Santa Fe, New Mexico. 

*Russell, William F., Deputy Director for Technical Services, Foreign Operations 
Administration, Executive Office Building, Room 297, Washington, D. C. 

Rutherford, Jean M., Instructor, School of Education, Slocum Hall, Syracuse Uni- 
versity, Syracuse 10, New York. 

Ryan, W. Carson, Kenan Professor of Education, University of North Carolina, 
Chapel Hill, North Carolina. 

Ryans, David G., Professor of Educational Research, University of California, Los 
Angeles 24, California. 
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Ryden, Einar R., Associate Professor of Psychology, Purdue University, Lafayette, 
Indiana. 


Safier, Daniel E., Research Director, Webster Publishing Company, 1808 Washington 
Avenue, St. Louis 3, Missouri. 

Salten, David G., Superintendent of Schools, Long Beach, New York. 

Sangren, Paul V., President, Western Michigan College, Kalamazoo, Michigan. 

Sargent, Cyril Garbutt, Professor and Director, Center for Field Studies, Harvard 
University, Spaulding House, 20 Oxford Street, Cambridge 38, Massachusetts. 

Sawin, Enoch I., Educational Specialist, Air University, AF ROTC Headquarters, 
81 Commerce Street, Montgomery 4, Alabama. 

Saylor, J. Galen, Chairman, Department of Secondary Education, University of 
Nebraska, 317 Teachers College, Lincoln 8, Nebraska. 

Scarborough, Barron B., Director, Bureau of Testing and Research, DePauw Uni- 
versity, 2 Harrison Hall, Greencastle, Indiana. 

Seates, Mrs. Alice Yeomans, Research Assistant, Division of Higher Education, 
Office of Education, U. S. Department of Health, Education, and Welfare, Wash- 
ington 25, D. C. 

Seates, Douglas E., Research Specialist, American Social Hygiene Association, 1790 
Broadway, New York 19, New York. (President of AERA, 1947-48; Chairman, 
Editorial Board, Review of Educational Research, 1937-43.) 

Schaefer, Robert Joseph, Chairman, Department of Education, Washington Uni- 
versity, St. Louis, Missouri. 

Schaerer, Robert W., Instructor and Research Assistant, Institute of Educational 
Research, Indiana University, Bloomington, Indiana. 

Schmid, John, Jr., Research and Evaluation Psychologist, Air Force Personnel and 
Training Research Center, Lackland Air Force Base, San Antonio, Texas. 

Schonell, Fred J., Professor of Education, University of Queensland, Brisbane, 
Queensland, Australia. 

Schott, Andrew F., Lecturer, Marquette University, Milwaukee, Wisconsin. 

Schultz, Raymond E., Associate Professor of Education, Florida State University, 
Tallahassee, Florida. 


Schwartz, Anthony N., Associate Professor, Teachers College, New York University, 
Plattsburg, New York. 

Schweiker, Robert F., Research Associate, American Institute for Research, 410 
Amberson Avenue, Pittsburgh 32, Pennsylvania. 

Schwertman, John B., Director, Center for the Study of Liberal Education for 
Adults, 940 East 58th Street, Chicago 37, Illinois. 

Scott, C. Winfield, Director of Student Personnel Services, New Haven State 
Teachers College, 501 Crescent Street, New Haven 11, Connecticut. 

Scott, Helen E., Associate Professor, Rhode Island College of Education, Providence, 
Rhode Island. 


Seott, William Owen Nixon, Assistant Professor, College of Education, University 
of Georgia, Athens, Georgia. 


+Sears, Jesse B., 40 Tevis Place, Palo Alto, California. (Professor Emeritus, Stanford 
University, Stanford University, California.) ? 

Seay, Maurice F., Director, Division of Education, The W. K. Kellogg Foundation, 
250 Champion Street, Battle Creek, Michigan. 

See, Harold W., Associate Professor of Education, Teachers College, University of 
Cincinnati, Cincinnati, Ohio. 

Segel, David, Specialist for Pupil Personnel, Office of Education, U. S. Department of 
Health, Education, and Welfare, Washington 25, D. C. (Secretary-Treasurer of 
AERA, 1943-46.) 

Segner, Esther F., Head, Graduate Homemaking Education Department, Mississippi 
State College, Box 475, State College, Mississippi. 

Sells, Saul B., Head, Department of Clinical Psychology, United States Air Force 
oe <" Aviation Medicine, Randolph Air Force . Randolph Field, San 

ntonio, Texas. 


Shamberger, Marvin, Director of Research, Missouri State Teachers Association, 
Columbia, Missouri. 
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Shane, Harold G., Professor of Education, School of Education, Northwestern Uni- 
versity, Evanston, Illinois. 


Shaw, Jack, Director of Student Personnel, Colorado State College of Education, 
Greeley, Colorado. 

Shay, Carleton B., Director of Testing, Santa Monica High School, 7th and Pico, 
Santa Monica, California. 

Shaycoft, Marion F., Coordinator of Technical and Research Advisory Services, 
American Institute for Research, 410 Amberson Avenue, Pittsburgh 32, Pennsylvania. 

Sten. James T., Director of Research, Independent School District, San Antonio, 

exas. 

Sheats, Paul H., Associate Director, University Extension, University of California, 
Los Angeles, California. 

Shevenell, Rev. Raymond H., OMI, Director, Institute of Psychology, University of 
Ottawa, Ontario, Canada. 

Shienbloom, Charles, Special Assistant to the Director of Educational Research. 
Board of Public Education, Philadelphia, Pennsylvania. 

Shores, J. Harlan, Professor of Education, University of Illinois, Urbana, Illinois. 

Shreve, John W., Director, Department of Research, Statistics, and Information, 
Cincinnati Public Schools, Cincinnati 2, Ohio. 

Silverman. Hirsch Lazaar, Director of Psychological Services and Psychologist-in- 
Charge, Nutley Public Schools, Junior High School Building, Nutley 10, New Jersey. 

Silvey, Herbert M.. Director, Bureau of Research and Examination Services, Iowa 
State Teachers College, Cedar Falls, Iowa. 

Simpson, Alfred D., Professor of Education, Graduate School of Education, Harvard 
University, Spaulding House, 20 Oxford Street, Cambridge 38, Massachusetts. (Vice- 
president of AERA, 1932-33.) 

Sims, Verner M., Professor of Psychology, Bureau of Educational Research, College 
of Education, University of Alabama, University, Alabama. 

Singleton, Carlton M., Principal, Countryside School, Newton Public Schools, 191 
Dedham Street, Newton Highlands 61, Massachusetts. 

Singleton, Gordon G., Professor of Education, Baylor University, 2104 Gorman, 
Waco, Texas. 

Skard, Aase Gruda, Associate Professor, Psychological Institute, University of Oslo, 
Karl Johans gt 47, Oslo, Norway. 

Skogsberg, Alfred H., Principal, Bloomfield Junior High School, 177 Franklin Street, 
Bloomfield, New Jersey. 

Sloane, Frank O., Director of Research, The Board of Public Instruction of Dade 
County, 275 N. W. Second Street, Miami 36, Florida. 

Smallenburg, Harry W., Director, Division of Research and Guidance, Los Angeles 
County Schools, Los Angeles, California. 

Smith, Alexander F., Instructor in Education, Murkland Hall, University of New 
Hampshire, Durham, New Hampshire. 

Smith, Allan B., Bureau of Educational Research and Service, University of Con- 
necticut, Storrs, Connecticut. 

Smith, B. Othanel, Professor of Education, University of Illinois, Urbana, Illinois. 

Smith, Donald E. P., Chief, Division of Reading Improvement Services, Bureau of 
Psychological Services, University of Michigan, Ann Arbor, Michigan. 

Smith, Dora V., Professor of Education, University of Minnesota, Minneapolis, 
Minnesota. 

Smith, Henry P., Professor of Education, School of Education, University of Kansas, 
Lawrence, Kansas. 

Smith, Herbert A., Director, Bureau of Educational Research and Service, University 
of Kansas, Lawrence, Kansas. 

Smith, Linda C., Professor, Education Department, State Teachers College, Cortland, 
New York. 

Sochor, E. Elona, Acting Director, The Reading Clinic, Department of Psychology, 
Temple University, Philadelphia 22, Pennsylvania. 

Solomon, Herbert, Associate Professor, Teachers College, New York 27, New York. 

Soper, Wayne W., Chief, Bureau of Statistical Services, State Education Department, 
Albany, New York. 

Spaney, Emma, Assistant Professor, Department of Psychology, Queens College of 
the City of New York, Flushing, New York. 
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Spaulding, Geraldine, Test Program Supervisor, Educational Records Bureau, 21 
Audubon Avenue, New York 32, New York. 

Spaulding, Seth J., Fundamental Education Program Specialist, Pan American 
Union, Washington 6, D. C. 

Spayde, Paul E., Director of Research and Guidance, The Public Schools, 1470 
Warren Road, Lakewood 7, Ohio. 

Spence, Ralph B., Professor of Education, Teachers College, Columbia University, 
New York 27, New York. 

Spencer, Peter L., Professor of Education, Claremont College, Claremont, California. 

Spickler, Emily A., 503 West 121st Street, New York 27, New York. 

Spitzer, Herbert F., Professor of Education, State University of Iowa, East Hall, 
Iowa City, Iowa. 

Staiger, Ralph C., Director, The Reading Clinic, Mississippi Southern College, Hat- 
tiesburg, Mississippi. 

Staines, Robert G., 22 Weedon Road, Artarmon, Sydney, Australia. 

Stalnaker, John M., Director of Studies, Association of American Medical Colleges, 
185 North Wabash Avenue, Chicago 1, Illinois. 

Standlee, Lloyd S., Research Associate, Institute of Educational Research, Indiana 
University, Rogers K, Bloomington, Indiana. 

Stanley, Julian C., Associate Professor of Education, University of Wisconsin, 
Madison 6, Wisconsin. 

Staton, Wesley M., Associate Professor of Physical Education and Health, College 
of Physical Education and Health, University of Florida, Gainesville, Florida. 

Stauffer, Russell G., Director, The Reading Clinic, Professor of Education, University 
of Delaware, Newark, Delaware. 

Stecklein, John E., Assistant Director, Bureau of Institutional Research, University 
of Minnesota, 21] Burton Hall, Minneapolis 14, Minnesota. 

Stegeman, William H., Director of Research, San Diego City Schools, Education 
Center, Park Boulevard and El Cajon, San Diego 5, California. 

+Stern, Bessie C. (Retired), 4013 Maine Avenue, Baltimore 7, Maryland (formerly 
Director of Finance, Statistics, and Educational Measurements, State Department 
of Education, Baltimore, Maryland). 

Sterner, William S., Assistant Professor of Education, School of Education, Rutgers 
University, New Brunswick, New Jersey. 

§Stevens, Glenn Z., Associate Professor of Agricultural Education, The Pennsylvania 
State University, Department of Agricultural Education, State College, Pennsylvania. 

Stewart, Lawrence H., Assistant Professor of Education, School of Education, Uni- 
versity of California, Berkeley 4, California. 

Stewart, Mrs. Naomi, Staff Associate, Educational Testing Service, 20 Nassau Street, 
Princeton, New Jersey. 

Stewart, Robert C., Assistant State Superintendent, Secondary Schools, State Depart- 
ment of Public Instruction, Dover, Delaware. — 

Stickler, W. Hugh, Director, Educational Research and Service, Florida State Uni- 
versity, Tallahassee, Florida. 

Stinnett, T. M., Executive Secretary, National Commission on Teacher Education 
and Professional Standards, National Education Association, 1201 Sixteenth Street, 
N. W., Washington 6, D. C. 

Stoke, Stuart M., Professor of Psychology and Education, Mount Holyoke College, 
South Hadley, Massachusetts. 

Stoughton, Robert W., Counselor Trainer, Bureau of Youth Services, State Depart- 
ment of Education, Hartford, Connecticut. 

§Stovall, Franklin L., Director, Division of Counseling and Testing, University of 
Houston, Houston 3, Texas. 

Strang, Ruth M., Professor of Education, Teachers College, Columbia University, 
New York 27, New York. 

Stratemeyer, Florence B., Professor of Education, Teachers College, Columbia Uni- 
versity, New York 27, New York. 


Strayer, George D., Jr., Professor of Educational Administration, College of Edu- 
cation, University of Washington, Seattle 5, Washington. 
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Strevell, Wallace H., Chairman, Department of Educational Administration, College 
of Education, University of Houston, Houston, Texas. 

ry J. B., Professor of Education and Psychology, State University of Iowa, Iowa 
City, lowa. 

*Studebaker, J. W., Vicepresident and Chairman, Editorial Board, Scholastic Maga- 
zines, 33 West 42nd Street, New York, New York. 

Sueltz, Ben A., Professor of Mathematics, State University Teachers College, Cort- 
land, New York. 

Super, Donald E., Professor of Education, Teachers College, Columbia University, 
New York 27, New York. 

Sutherland, Margaret, Assistant Professor, Department of Education, University of 
California, Davis, California. 

Swann, Reginald L., Associate Professor of Psychology and Education, Teachers 
College of Connecticut, New Britain, Connecticut. 

Swenson, Esther J., Professor of Elementary Education, University cf Alabama, 
University, Alabama. 

Symonds, Percival M., Professor of Education, Teachers College, Columbia Uni- 
versity, New York, New York. 

Taba, Hilda, Professor of Education, San Francisco State College, 1600 Holloway, 
San Francisco, California. 

Tait, Arthur T., Director of Research, California Test Bureau, 5916 Hollywood 
Boulevard, Los Angeles 28, California. 

Tasch, Ruth J., Resident in Clinical Psychology, Department of Psychological 
Services, State Hospital, Box 476, Jamestown, North Dakota. 

Tate, Merle W., Associate Professor, School of Education, University of Pennsylvania, 
Philadelphia 4, Pennsylvania. 

Tatum, Beulah Benton, Assistant Professor of Education, The Johns Hopkins Uni- 
versity, Baltimore 18, Maryland. 

— Hazel E., Director of Testing, East Carolina College, Greenville, North 
Carolina. 

*Terman, Lewis M., Professor Emeritus of Psychology, Stanford University, Stan- 
ford, California. 

Terry, Paul W., Professor and Chairman of the Department of Educational Psy- 
chology, College of Education, University of Alabama, University, Alabama. 

Theisen, W. W., Assistant Superintendent of Schools, Milwaukee, Wisconsin. (Presi- 
dent of AERA, 1922-23.) 

Thibadeau, Charles R., Superintendent of Schools, Belmont 78, Massachusetts. 

Thiede, Wilson B., Director of Field Services, University of Wisconsin, Room 313, 
Extension Building, Madison 6, Wisconsin. 

Thomas, Maurice J., Professor of Education, University of Pittsburgh, 2528 Cathe- 
dral of Learning, Pittsburgh 13, Pennsylvania. 

Thompson, Anton, Supervisor of Research, Long Beach Public Schools, 715 Locust 
Avenue, Long Beach 13, California. 

Thompson, George G., Professor of Psychology, Syracuse University, Syracuse, New 
York. 

Thorndike, Robert L., Professor of Education, Teachers College, Columbia Uni- 
versity, New York 27, New York. 

Thorp, Mary T., Director, Henry Barnard School, Rhode Island College of Educa- 
tion, Providence, Rhode Island. 

Thurstone, Thelma Gwinn, Professor of Education, University of North Carolina, 
Chapel Hill, North Carolina. 

Tiedeman, David V., Lecturer on Education, Harvard University, 13 Kirkland Street, 
Cambridge 38, Massachusetts. 

Tiegs, Ernest W., Professor of Education, Los Angeles State College, 230 South 
Mansfield Avenue, Los Angeles 36, California. 

Tinkelman, Sherman, Supervisor of Test Development, New York State Education 
Department, Albany 1, New York. 

Todd, Mrs. Vivian Edmiston, Box 8035, Long Beach 8, California. 


— Herbert A., Professor of Psychology, Ohio State University, Columbus 10, 
io. 
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Topetzes, Nick John, Assistant Professor, Department of Education, Marquette Uni- 
versity, Milwaukee, Wisconsin. 


Torgerson, T. L., Emeritus Professor of Education, University of Wisconsin, Madi- 
son, Wisconsin. 

Townsend, Agatha, Consultant, Educational Records Bureau, 21 Audubon Avenue, 
New York 32, New York. 

Trabue, M. R., Dean, College of Education, Pennsylvania State University, State 
College, Pennsylvania. (President of AERA, 1925-26.) 

Travers, Robert M. W., Chief, Prediction Research Branch, Personnel Research 
Laboratory, Air Force Personnel and Training Research Center, Lackland Air 
Force Base, San Antonio, Texas. 

Traxler, Arthur E., Executive Director, Educational Records Bureau, 21 Audubon 
Avenue, New York 32, New York. (President of AERA, 1950-51.) 

Treacy, John P., Director, Department of Education, Marquette University, Mil- 
waukee 3, Wisconsin. 

Triggs, Frances Oralind, Chairman, Committee on Diagnostic Reading Tests, Inc., 
Kingscote Apt. 3G, 419 West 119th Street, New York 27, New York. 

Trione, Verdun, Director of Guidance and Research—Sonoma County Superintendent 
of Schools, Santa Rosa, California. 

Trow, William Clark, Professor of Educational Psychology, School of Education, 
University of Michigan, Ann Arbor, Michigan. 

Troyer, Maurice E., Vicepresident (Curriculum and Instruction), Japan Inter- 
national Christian University, 1500 Osawa, Mitaka-shi, Tokyo, Japan. 

Truitt, W. J. B., Director of Research, Norfolk Public Schools, School Administration 
Building, Bank and Charlotte Street, Norfolk 10, Virginia. 

Tully, Marguerite, Supervisor, Psychological Department and School Clinic for 
Problem Children, Department of Public Schools, 20 Summer Street, Providence 2, 
Rhode Island. 

Turnbull, William W., Vicepresident for Testing Operations, Educational Testing 
Service, Princeton, New Jersey. 

Turney, Austin H., Professor of Education, University of Kansas, 120 Fraser Hall, 
Lawrence, Kansas. 

Tyler, Frederick T., Professor of Education, University of California, Berkeley 4, 
California. 

Tyler, Harry E., Assistant to the Director for Research and Development, United 
States Armed Forces Institute, 102 North Hamilton Street, Madison 3, Wisconsin. 

Tyler, Louise L., Teacher, Chicago Teachers College, 6800 South Stewart Avenue, 
Chicago, Illinois. 

Tyler, Ralph W., Director, Center for Advanced Study in the Behavioral Sciences, 
202 Junipero Serra Boulevard, Stanford, California. 

Umstattd, James G., Professor of Secondary Education, University of Texas, Austin, 
Texas. 

Unruh, Adolph, Associate Professor of Education, Washington University, St. Louis 
5, Missouri. 

Upshall, Charles C., Industrial Relations Department, Eastman Kodak Company, 
Rochester 4, New York. 

Urell, Catherine, Research Assistant, Bureau of Educational Research, Board of 
Education of the City of New York, 110 Livingston Street, Brooklyn 1, New York. 

Van Wagenen, M. J., Department of Educational Psychology, University of Minne- 
sota, Minneapolis, Minnesota. 

Varty, Jonathan W., Assistant Professor of Education, Brooklyn College, Bedford 
Avenue and Avenue H, Brooklyn 10, New York. 

Vaughn, Kenneth W., Partner, Rohrer, Hibler, and Replogle, 715 North Van Buren, 
Milwaukee 2, Wisconsin. 

Voges, Bernard H., Assistant Director of School Finance, Laws and Statistics, 
Missouri State Board of Education, P. O. Box 480, Jefferson City, Missouri. 

Vosk, Mare, Director of the Scientific Research Department, American Jewish Com- 
mittee, 386 Fourth Avenue, New York 16, New York. 

Votaw, D. F., Professor of Education, Southwest Texas State College, San Marcos, 
Texas. 

Vredevoe, Lawrence E., Professor of Education, University of California, Los 
Angeles 24, California. 
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Walker, George H., Jr., Dean, Graduate School of Education and Director, Bureay 
of Educational Research and Service, Texas College, Tyler, Texas. 

Walker, Helen M., Professor of Education, Teachers College, Columbia University. 
aaa P 27, New York. (Secretary-Treasurer of AERA, 1940-43; President, 

0.) 

Walling, W. Donald, Educational Consultant, Private Practice, 391 Winter Street 
Extension, Troy, New York. 

ba > J. Hartt, Dean, College of Education, Butler University, Indianapolis 7, 
ndiana. 

Wandt, Edwin, Assistant Professor of Education, Los Angeles State College, 855 
North Vermont Avenue, Los Angeles 29, California. 

Wann, Kenneth D., Associate Professor of Education, Teachers College, Columbia 
University, New York 27, New York. 

Warrington, Willard G., Assistant Professor, Board of Examiners, Michigan State 
College, East Lansing, Michigan. 

Washburne, Carleton W., Director, Teacher Education Program, Director, Graduate 
Division, Brooklyn College, Brooklyn 10, New York. (Vicepresident of AERA, 1925.) 

Washton, Nathan S., Assistant Professor, Department of Education, Queens College, 
Flushing, New York. 

Waterman, Ivan R., Chief, Bureau of Textbooks and Publications, California State 
Department of Education, Sacramento 14, California. 

Waters, Eugene A., Vicepresident, The University of Tennessee, Knoxville 16, 
Tennessee. 

Watkins, Ralph K., Professor of Education, University of Missouri, Columbia, 
Missouri. 

Watson, Goodwin, Professor of Education, Teachers College, Columbia University, 
New York 27, New York. 

Watson, Jack M., Professor of Music Education, School of Music, Indiana Uni- 
versity, Bloomington, Indiana. 

Weaver, Edward K., Professor of Science Education, Atlanta University, Atlanta, 
Georgia. 

Weaver, J. Fred, Associate Professor of Education, School of Education, Boston 
University, Boston 15, Massachusetts. 

Weedon, Vivian, Curriculum Consultant, National Safety Council, Chicago, Illinois. 

Weeks, Harold L., Supervisor of Research, San Bernardino City Schools, 799 F 
Street, San Bernardino, California. 

Weinrich, Ernest F., Assistant Superintendent, Public Schools, Schenectady 5, New 
York. 

Weisiger, Louise P., Director of Research, Richmond Public Schools, 407 North 12th 
Street, Richmond 19, Virginia. 

Weitz, Henry, Director, Bureau of Testing and Guidance, Duke University, Durham, 
North Carolina. 

Weitzel, Henry I., Research Director, Pasadena City Schools, 351 South Hudson 
Avenue, Pasadena 5, California. 

Wellek, A. A., Director, Counseling and Testing Services, University of New Mexico, 
Albuquerque, New Mexico. 

Wendt, Paul R., 241 Molimo Drive, San Francisco 27, California. 

Wesman, Alexander G., Associate Director, Test Division, Psychological Corporation, 
522 Fifth Avenue, New York 36, New York. 

West, Leonard J., Research Psychologist, Training Aids Research Laboratory, Air 
Force Personnel and Training Research Center, Chanute Air Force Base, Illinois. 

Westover, Frederick L., Associate Professor, College of Education, University of 
Alabama, University, Alabama. 

Weyer, Frank E., Dean, Hastings College, Hastings, Nebraska. 

oo Lester R., Director of Reading Clinic, University of Miami, Coral Gables, 
Florida. 

Whipple, Gertrude, Associate Professor of Education, College of Education, Wayne 
University, Detroit 1, Michigan. 

Whitehead, Willis A., Educational Specialist, Outcalt, Guenther and Associates, 
Architects, 13124 Shaker Square, Cleveland 20, Ohio. 

= John A., Professor of Industrial Arts Education, Miami University, Oxford. 

0. 


522 








December 1954 List oF MEMBERS 
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general fifteen categories. The actual titles of issues, therefore, have varied somewhat. 

Because the October 1953 issue deals with new materials or subjects that have been 
treated only incidentally in previous cycles, it is listed below as topic 16. Single copies 
prior to 1949, $1; after 1949, $1.50 each, with quantity discounts: 10% discount on 2 to 
9 copies; 25% on 10 to 99 copies, 334% on 100 or more copies. Orders should be sent 
to 1201 Sixteenth Street, N. W., Washington 6, D. C. 

1. History or Epucation anp Comparative Epucation. V1I:4 (October 1936); 
IX:4 (October 1939). 

2. SoctaL Founpations anp Contrisutions OF Epucation. VII:] (February 1937) *; 
X:1 (February 1940); XIII:1 (February 1943); XVI:1 (February 1946)*; 
XIX:1 (February 1949). 

3. ORGANIZATION, ADMINISTRATION, AND SUPERVISION OF EpucaTion. I:3 (June 
1931); IV:4 (October 1934); VII:4 (October 1937); X:4 (October 1940); 
XIII:4 (October 1943) *; XVI:4 (October 1946)*; XIX:4 (October 1949)*; 
XII:4 (October 1952). 

4, Lecat anp Fiscat Aspects or Epucation. II:2 (April 1932); III:5 (December 
1933); V:2 (April 1935); VIII:2 (April 1938); X1:2 (April 1941); XIV:2 
(April 1944) *; XVITI:2 (April 1947) *; xX :2 (April 1950). 

5. Scooot Piant anp Equipment. II:5 (December 1932); V:4 (October 1935); 
VIII:4 (October 1938)*; XII:2 (April 1942)*; XV:1 (February 1945)*; 
XVIII:1 (February 1948); XXI:1 (February 1951). 

6, Teacher Personner. I:2 (April 1931); IV:3 (June 1934); VII:3 (June 1937) *; 
X:3 (June 1940); XIII:3 (June 1943) ; XVI:3 (June 1946) *; XIX:3 (June 
1949) ; XXII:3 (june 1952). 

7. Pur Personnet, Gumance, ann Counseuinc. IIT:3 (June 1933); VI:2 (April 
1936)*; IX:2 (April 1939)*; XII:1 (February 1942)*; XV:2 (April 1945) *; 
XVIII:2 (April 1948) *; xKI:2 (April 1951); XXIV:1 (February 1954) ; 
XXIV:2 (April 1954). 

8. EpucationaL ano Psycuovocicat Tests. 11:3 (June 1932); I1:4 (October 1932) ; 
IlI:1 (February 1933); V:3 (June 1935) V:5 (December 1935); VIII:3 (June 
1938); VIII:5 (December 1938); XI:1 (February 1941)*; XIV:1 (February 
1944) a :1 (February 1947) *; XX:1 (February 1950) ; XXIII:1 (Feb- 

ruary 

9. Metnops or Research ano Expermentation. IV:] (February 1934) ; 
(December 1939)*; XII:5 (December 1942); XV:5 (December ras) 
XVIII:5 (December 1948) ; XXI:5 (December 1951) ; XXIV: 5 (December 1954), 

10. MENTAL AND PuysicaL Heatran, Growrn anp Devetopment. III:2 (April 1933) ; 
VI:1 (February 1936) ; 5 (December 1936); IX:1 one og X:5 
(December 1940) ; XI:5 ‘icons 1941) *; uit (December 1943) ; XIV:5 
(December 1944) ; XVI:5 (December 1946) ; AL (December 1947» XIX:5 
(December 1949) ; XX:5 (December 1950); XXII:5 (December 1952). 

11, — Aspects or Instruction, LEARNING, TEACHING, AND THE CURRICULUM, 

Lope wn. 1931); III:4 (October 1933)’; IV:2 (A 1934); VI:3 (June 

108s :2 (April 1937)*; IX:3 (June 1939) ; 73 (June 1942); XV:3 
(June 1945) ; :3 (June 1948); XXI:3 (June 1951); XXIII:2° (April 
1953) ; XXIV:3 (June 1954) ; XXIV: 4 (October 1954). 

12. SpectaL METHODS AND THE PsycHOLocy oF Spectat Sussects. 1:4 (October 1931) ; 
I:5 (December 1931); I[:2 (February 1932); IV:5 (December 1934) ; 
(February 1935) ; VII:5 (December 1937)*; VIII:1 (February 1938). 

13. Spectan SussecTMATTER Frewps. X:2 (April 1940) ; XI:4 (October 1941); XII:4 
(October 1942)*; XITI:2 (April 1943) *; XV:4 (October 1945)*; XVI 2 (A 
1946) ; XVIII:4 (October 1948); XIX 2 (April 1949) ; XXII:2° (April 1952). 

14. Epucation or Exceprionat CHILDREN AND Minority Groups. XI:3 (June 1941) *; 
XIV:3 (June 1944)*; XXIII:5 (December 1953). 

15. Epucation ror Worx, Crrizensuip, AND Letsure (Including Adult Education). 
XIV:4 (October 1944) *; XVII:3 (June 1947)*; XVII:4 (October 1947) *; 
XX:3 (June 1950) *; XX:4 (October 1950) ; XXIII:3 (June 1953). 

16. Research on HuMAN ReLaTions AND Procrams or Action. XXIII:4 (October 


1953). Forthcoming Issues 


Soctat Framework or Epucation. Feb: 1955. B. Othanel Smith, Chairman. 
Lancuace Arts AnD Fine Arts. April 1955. David H. Russell, Chairman. 


* Out of print. 





